Check if string consists only of valid ISO 8859-1 characters - c#

How to check if a string consists only of chars, which can be successfully encoded in ISO 8859-1? Or in other words - how to find "illegal"/"not ISO 8859-1 compatible" chars in a string?

Try this:
private static bool IsValidISO(string input)
{
byte[] bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(input);
String result = Encoding.GetEncoding("ISO-8859-1").GetString(bytes);
return String.Equals(input, result);
}
This answer is based on an answer of this Java question (my code is the C# equivalent):
http://www.velocityreviews.com/forums/t137810-checking-whether-a-string-contains-only-iso-8859-1-chars.html

You could setup an array or list of valid characters and then iterate through your string to check if each of them exists in your list of valid characters. The list can be created by adding all valid latin-1 characters to it.

I came up with this idea. Might this be possible?
private static bool IsValidISO(string input)
{
foreach (char c in input)
{
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
byte[] isoBytes = iso.GetBytes(c.ToString());
byte[] utfBytes = Encoding.Convert(iso, utf8, isoBytes);
string convertedC = utf8.GetString(utfBytes);
if (c != '?' && convertedC == "?")
return false;
}
return true;
}

Related

Compare Windows-1252 string to UTF-8 string

my goal is to convert a .NET string (Unicode) into Windows-1252 and - if necessary - store the original UTF-8 string in a Base64 entity.
For example, the string "DJ Doena" converted to 1252 is still "DJ Doena".
However if you convert the Japanese kanjii for tree (木) into 1251 you end up with a question mark.
These are my test strings:
String doena = "DJ Doena";
String umlaut = "äöüßéèâ";
String allIn = "< ä ß á â & 木 >";
This is how I convert the string in the first place:
using (MemoryStream ms = new MemoryStream())
{
using (StreamWriter sw = new StreamWriter(ms, Encoding.UTF8))
{
sw.Write(decoded);
sw.Flush();
ms.Seek(0, SeekOrigin.Begin);
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding(1252)))
{
encoded = sr.ReadToEnd();
}
}
}
Problem is, while debugging string comparison claims that both are indeed identical, so a simple == or .Equals() doesn't suffice.
This is how I try to find out if I need base64 and produce it:
private static String GetBase64Alternate(String utf8Text, String windows1252Text)
{
Byte[] utf8Bytes;
Byte[] windows1252Bytes;
String base64;
utf8Bytes = Encoding.UTF8.GetBytes(utf8Text);
windows1252Bytes = Encoding.GetEncoding(1252).GetBytes(windows1252Text);
base64 = null;
if (utf8Bytes.Length != windows1252Bytes.Length)
{
base64 = Convert.ToBase64String(utf8Bytes);
}
else
{
for(Int32 i = 0; i < utf8Bytes.Length; i++)
{
if(utf8Bytes[i] != windows1252Bytes[i])
{
base64 = Convert.ToBase64String(utf8Bytes);
break;
}
}
}
return (base64);
}
The first string doena is completely identical and doesn't produce a base64 result
Console.WriteLine(String.Format("{0} / {1}", windows1252Text, base64Text));
results in
DJ Doena /
But the second string umlauts already has twice the bytes in UTF-8 than in 1252 and thus produces an Base64 string even though it does not appear to be necessary:
äöüßéèâ / w6TDtsO8w5/DqcOow6I=
And the third one does what it's supposed to do (no more "木" but a "?", thus base64 needed):
< ä ß á â & ? > / PCDDpCDDnyDDoSDDoiAmIOacqCA+
Any clues how my Base64 getter could be enhanced a) for performance b) for better results?
Thank you in advance. :-)
I'm not sure I completely understood the question. But I tried. :) If I do understand correctly, this code does what you want:
static void Main(string[] args)
{
string[] testStrings = { "DJ Doena", "äöüßéèâ", "< ä ß á â & 木 >" };
foreach (string text in testStrings)
{
Console.WriteLine(ReencodeText(text));
}
}
private static string ReencodeText(string text)
{
Encoding encoding = Encoding.GetEncoding(1252);
string text1252 = encoding.GetString(encoding.GetBytes(text));
return text.Equals(text1252, StringComparison.Ordinal) ?
text : Convert.ToBase64String(Encoding.UTF8.GetBytes(text));
}
I.e. it encodes the text to Windows-1252, then decodes back to a string object, which it then compares with the original. If the comparison succeeds, it returns the original string, otherwise it encodes it to UTF8, and then to base64.
It produces the following output:
DJ Doena
äöüßéèâ
PCDDpCDDnyDDoSDDoiAmIOacqCA+
In other words, the first two strings are left intact, while the third is encoded as base64.
In your first code you are encoding the string using one encoding, then decoding it using a different encoding. That doesn't give you any reliable result at all; it's the equivalent of writing out a number in octal, then reading it as if it was in decimal. It seems to work just fine for numbers up to 7, but after that you get useless results.
The problem with the GetBase64Alternate method is that it's encoding a string to two different encodings, and assumes that the first encoding doesn't support some of the characters if the second encoding resulted in a different set of bytes.
Comparing the byte sequences doesn't tell you whether any of the encodings failed. The sequences will be different if it failed, but it will also be different if there are any characters that are encoded differently between the encodings.
What you want to do is to determine if the encoding actually worked for all characters. You can do that by creating an Encoding instance with a fallback for unsupported characters. There is an EncoderExceptionFallback class that you can use for that, which throws an EncoderFallbackException if it's called.
This code will try use the Windows-1252 encoding on a string, and sets the ok variable to false if the encoding doesn't support all characters in the string:
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
As you are not actually going to used the encoded result for anything, you can use the GetByteCount method. It will check how all characters would be encoded without producing the encoded result.
Used in your method it would be:
private static String GetBase64Alternate(string text) {
Encoding e = Encoding.GetEncoding(1252, new EncoderExceptionFallback(), new DecoderExceptionFallback());
bool ok = true;
try {
e.GetByteCount(allIn);
} catch (EncoderFallbackException) {
ok = false;
}
return ok ? null : Convert.ToBase64(Encoding.UTF8.GetBytes(text));
}

How can i convert a string of characters into binary string and back again?

I need to convert a string into it's binary equivilent and keep it in a string. Then return it back into it's ASCII equivalent.
You can encode a string into a byte-wise representation by using an Encoding, e.g. UTF-8:
var str = "Out of cheese error";
var bytes = Encoding.UTF8.GetBytes(str);
To get back a .NET string object:
var strAgain = Encoding.UTF8.GetString(bytes);
// str == strAgain
You seem to want the representation as a series of '1' and '0' characters; I'm not sure why you do, but that's possible too:
var binStr = string.Join("", bytes.Select(b => Convert.ToString(b, 2)));
Encodings take an abstract string (in the sense that they're an opaque representation of a series of Unicode code points), and map them into a concrete series of bytes. The bytes are meaningless (again, because they're opaque) without the encoding. But, with the encoding, they can be turned back into a string.
You seem to be mixing up "ASCII" with strings; ASCII is simply an encoding that deals only with code-points up to 128. If you have a string containing an 'é', for example, it has no ASCII representation, and so most definitely cannot be represented using a series of ASCII bytes, even though it can exist peacefully in a .NET string object.
See this article by Joel Spolsky for further reading.
You can use these functions for converting to binary and restore it back :
public static string BinaryToString(string data)
{
List<Byte> byteList = new List<Byte>();
for (int i = 0; i < data.Length; i += 8)
{
byteList.Add(Convert.ToByte(data.Substring(i, 8), 2));
}
return Encoding.ASCII.GetString(byteList.ToArray());
}
and for converting string to binary :
public static string StringToBinary(string data)
{
StringBuilder sb = new StringBuilder();
foreach (char c in data.ToCharArray())
{
sb.Append(Convert.ToString(c, 2).PadLeft(8, '0'));
}
return sb.ToString();
}
Hope Helps You.
First convert the string into bytes, as described in my comment and in Cameron's answer; then iterate, convert each byte into an 8-digit binary number (possibly with Convert.ToString, padding appropriately), then concatenate. For the reverse direction, split by 8 characters, run through Convert.ToInt16, build up a byte array, then convert back to a string with GetString.

Convert UTF-8 to Chinese Simplified (GB2312)

Is there a way to convert UTF-8 string to Chinese Simplified (GB2312) in C#. Any help is greatly appreciated.
Regards
Jyothish George
The first thing to be aware of is that there's no such thing as a "UTF-8 string" in .NET. All strings in .NET are effectively UTF-16. However, .NET provides the Encoding class to allow you to decode binary data into strings, and re-encode it later.
Encoding.Convert can convert a byte array representing text encoded with one encoding into a byte array with the same text encoded with a different encoding. Is that what you want?
Alternatively, if you already have a string, you can use:
byte[] bytes = Encoding.GetEncoding("gb2312").GetBytes(text);
If you can provide more information, that would be helpful.
Try this;
public string GB2312ToUtf8(string gb2312String)
{
Encoding fromEncoding = Encoding.GetEncoding("gb2312");
Encoding toEncoding = Encoding.UTF8;
return EncodingConvert(gb2312String, fromEncoding, toEncoding);
}
public string Utf8ToGB2312(string utf8String)
{
Encoding fromEncoding = Encoding.UTF8;
Encoding toEncoding = Encoding.GetEncoding("gb2312");
return EncodingConvert(utf8String, fromEncoding, toEncoding);
}
public string EncodingConvert(string fromString, Encoding fromEncoding, Encoding toEncoding)
{
byte[] fromBytes = fromEncoding.GetBytes(fromString);
byte[] toBytes = Encoding.Convert(fromEncoding, toEncoding, fromBytes);
string toString = toEncoding.GetString(toBytes);
return toString;
}
source here

C# and utf8_decode

Is there a C# utf8_decode equivalent?
Use the Encoding class.
For example:
byte[] bytes = something;
string str = Encoding.UTF8.GetString(bytes);
Yes. You can use the System.Text.Encoding class to convert the encoding.
string source = "Déjà vu";
Encoding unicode = Encoding.Unicode;
// iso-8859-1 <- codepage 28591
Encoding latin1 = Encoding.GetEncoding(28591);
Byte[] result = Encoding.Convert(unicode, latin1, unicode.GetBytes(s));
// result contains the byte sequence for the latin1 encoded string
edit: or simply
string source = "Déjà vu";
Byte[] latin1 = Encoding.GetEncoding(28591).GetBytes(source);
string (System.String) is always unicode encoded, i.e. if you convert the byte sequence back to string (Encoding.GetString()) your data will again be stored as utf-16 codepoints again.
If your input is a string here is a method that would probably work (assuming your from wester europe :)
public string Utf8Decode(string inputDate)
{
return Encoding.GetEncoding("iso-8859-1").GetString(Encoding.UTF8.GetBytes(inputDate));
}
Of course, if the current encoding of the inputData is not latin1, change the "iso-8859-1" to the correct encoding.
I tried to make this implementation on Xamarin C#.
The code below worked for me:
public static string Utf8Encode(string inputDate)
{
byte[] bytes = Encoding.UTF8.GetBytes(inputDate);
return Encoding.GetEncoding("iso-8859-1").GetString(bytes,0, bytes.Length);
}
public static string Utf8Decode(string inputDate)
{
byte[] bytes = Encoding.GetEncoding("iso-8859-1").GetBytes(inputDate);
return Encoding.UTF8.GetString(bytes, 0, bytes.Length);
}

How to convert an MD5 hash to a string and use it as a file name

I am taking the MD5 hash of an image file and I want to use the hash as a filename.
How do I convert the hash to a string that is valid filename?
EDIT: toString() just gives "System.Byte[]"
How about this:
string filename = BitConverter.ToString(yourMD5ByteArray);
If you prefer a shorter filename without hyphens then you can just use:
string filename =
BitConverter.ToString(yourMD5ByteArray).Replace("-", string.Empty);
System.Convert.ToBase64String
As a commenter pointed out -- normal base 64 encoding can contain a '/' character, which obivously will be a problem with filenames. However, there are other characters that are usable, such as an underscore - just replace all the '/' with an underscore.
string filename = Convert.ToBase64String(md5HashBytes).Replace("/","_");
Try this:
Guid guid = new Guid(md5HashBytes);
string hashString = guid.ToString("N");
// format: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
string hashString = guid.ToString("D");
// format: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
string hashString = guid.ToString("B");
// format: {xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}
string hashString = guid.ToString("P");
// format: (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)
This is probably the safest for file names. You always get a hex string and never worry about / or +, etc.
byte[] data = md5Hash.ComputeHash(Encoding.UTF8.GetBytes(inputString));
StringBuilder sBuilder = new StringBuilder();
for (int i = 0; i < data.Length; i++)
{
sBuilder.Append(data[i].ToString("x2"));
}
string hashed = sBuilder.ToString();
Try this:
string Hash = Convert.ToBase64String(MD5.Create().ComputeHash(Encoding.UTF8.GetBytes("sample")));
//input "sample" returns = Xo/5v1W6NQgZnSLphBKb5g==
or
string Hash = BitConverter.ToString(MD5.Create().ComputeHash(Encoding.UTF8.GetBytes("sample")));
//input "sample" returns = 5E-8F-F9-BF-55-BA-35-08-19-9D-22-E9-84-12-9B-E6
Technically using Base64 is bad if this is Windows, filenames are case insensitive (at least in explorers view).. but in base64, 'a' is different to 'A', this means that perhaps unlikely but you end up with even higher rate of collision..
A better alternative is hexadecimal like the bitconverter class, or if you can- use base32 encoding (which after removing the padding from both base64 and base32, and in the case of 128bit, will give you similar length filenames).
Try Base32 hash of MD5. It gives filename-safe case insensitive strings.
string Base32Hash(string input)
{
byte[] buf = MD5.Create().ComputeHash(Encoding.UTF8.GetBytes(input));
return String.Join("", buf.Select(b => "abcdefghijklmonpqrstuvwxyz234567"[b & 0x1F]));
}

Categories

Resources