Utf7Encoding Text truncation

Utf7Encoding Text truncation - c#

I was having an issue with the Utf7Encoding class truncating the '+4' sequence.
I would be very interested to know why this was happening.
I tried Utf8Encoding for getting string from the byte[] array and it seem to work honky dory.
Are there any known issues like that with Utf8? Essentially I use the output produced by this conversion to construct html out of rtf string.
Here is the snippet:
UTF7Encoding utf = new UTF7Encoding();
UTF8Encoding utf8 = new UTF8Encoding();
string test = "blah blah 9+4";
char[] chars = test.ToCharArray();
byte[] charBytes = new byte[chars.Length];
for (int i = 0; i < chars.Length; i++)
{
charBytes[i] = (byte)chars[i];
}
string resultString = utf8.GetString(charBytes);
string resultStringWrong = utf.GetString(charBytes);
Console.WriteLine(resultString); //blah blah 9+4
Console.WriteLine(resultStringWrong); //blah 9

Converting to byte array through char array like that does not work. If you want the strings as charset-specific byte[] do this:
UTF7Encoding utf = new UTF7Encoding();
UTF8Encoding utf8 = new UTF8Encoding();
string test = "blah blah 9+4";
byte[] utfBytes = utf.GetBytes(test);
byte[] utf8Bytes = utf8.GetBytes(test);
string utfString = utf.GetString(utfBytes);
string utf8String = utf8.GetString(utf8Bytes);
Console.WriteLine(utfString);
Console.WriteLine(utf8String);
Output:
blah blah 9+4
blah blah 9+4

Your are not transating the string to utf7 bytes correctly. You should call utf.GetBytes() instead of casting the characters to a byte.
I suspect that in utf7 the ascii code corresponding to '+' is actually reserved for encoding international unicode characters.

Related

How to remove unknown chars on string in windows-1251 charset

I have a text which cannot be converted to windows-1251 charset. For example:
中华全国工商业联合会-HelloWorld
I have a method for converting from UTF8 to windows-1251:
static string ChangeEncoding(string text)
{
if (text == null || text == "")
return "";
Encoding win1251 = Encoding.GetEncoding("windows-1251");
Encoding ascii = Encoding.UTF8;
byte[] utfBytes = ascii.GetBytes(text);
byte[] isoBytes = Encoding.Convert(ascii, win1251, utfBytes);
return win1251.GetString(isoBytes);
}
Now it is returning this:
??????????-HelloWorld
I don't want to show chars which was not converted to windows1251 charset correct. In this case I want just:
-HelloWorld
How can I do this?

According to #JeroenMostert suggestion this method helped me:
public static string ChangeEncoding(string text)
{
Encoding win1251 = Encoding.GetEncoding("windows-1251", new EncoderReplacementFallback(string.Empty), new DecoderExceptionFallback());
return win1251.GetString(Encoding.Convert(Encoding.UTF8, win1251, Encoding.UTF8.GetBytes(text)));
}

How to convert a string to UTF8?

I have a string that contains some unicode, how do I convert it to UTF-8 encoding?

This snippet makes an array of bytes with your string encoded in UTF-8:
UTF8Encoding utf8 = new UTF8Encoding();
string unicodeString = "Quick brown fox";
byte[] encodedBytes = utf8.GetBytes(unicodeString);

Try this function, this should fix it out-of-box. You may need to fix naming conventions though.
private string UnicodeToUTF8(string strFrom)
{
byte[] bytSrc;
byte[] bytDestination;
string strTo = String.Empty;
bytSrc = Encoding.Unicode.GetBytes(strFrom);
bytDestination = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, bytSrc);
strTo = Encoding.ASCII.GetString(bytDestination);
return strTo;
}

This should be with the minimum code:
byte[] bytes = Encoding.Default.GetBytes(myString);
myString = Encoding.UTF8.GetString(bytes);

try to this code
string unicodeString = "Quick brown fox";
var bytes = new List<byte>(unicodeString);
foreach (var c in unicodeString)
bytes.Add((byte)c);
var retValue = Encoding.UTF8.GetString(bytes.ToArray());

Unicode-to-string conversion in C#

How can I convert a Unicode value to its equivalent string?
For example, I have "రమెశ్", and I need a function that accepts this Unicode value and returns a string.
I was looking at the System.Text.Encoding.Convert() function, but that does not take in a Unicode value; it takes two encodings and a byte array.
I bascially have a byte array that I need to save in a string field and then come back later and convert the string first back to a byte array.
So I use ByteConverter.GetString(byteArray) to save the byte array to a string, but I can't get it back to a byte array.

Use .ToString();:
this.Text = ((char)0x00D7).ToString();

Try the following:
byte[] bytes = ...;
string convertedUtf8 = Encoding.UTF8.GetString(bytes);
string convertedUtf16 = Encoding.Unicode.GetString(bytes); // For UTF-16
The other way around is using `GetBytes():
byte[] bytesUtf8 = Encoding.UTF8.GetBytes(convertedUtf8);
byte[] bytesUtf16 = Encoding.Unicode.GetBytes(convertedUtf16);
In the Encoding class, there are more variants if you need them.

To convert a string to a Unicode string, do it like this: very simple... note the BytesToString function which avoids using any inbuilt conversion stuff. Fast, too.
private string BytesToString(byte[] Bytes)
{
MemoryStream MS = new MemoryStream(Bytes);
StreamReader SR = new StreamReader(MS);
string S = SR.ReadToEnd();
SR.Close();
return S;
}
private string ToUnicode(string S)
{
return BytesToString(new UnicodeEncoding().GetBytes(S));
}

UTF8Encoding Class
UTF8Encoding uni = new UTF8Encoding();
Console.WriteLine( uni.GetString(new byte[] { 1, 2 }));

There are different types of encoding. You can try some of them to see if your bytestream get converted correctly:
System.Text.ASCIIEncoding encodingASCII = new System.Text.ASCIIEncoding();
System.Text.UTF8Encoding encodingUTF8 = new System.Text.UTF8Encoding();
System.Text.UnicodeEncoding encodingUNICODE = new System.Text.UnicodeEncoding();
var ascii = string.Format("{0}: {1}", encodingASCII.ToString(), encodingASCII.GetString(textBytesASCII));
var utf = string.Format("{0}: {1}", encodingUTF8.ToString(), encodingUTF8.GetString(textBytesUTF8));
var unicode = string.Format("{0}: {1}", encodingUNICODE.ToString(), encodingUNICODE.GetString(textBytesCyrillic));
Have a look here as well: http://george2giga.com/2010/10/08/c-text-encoding-and-transcoding/.

var ascii = $"{new ASCIIEncoding().ToString()}: {((ASCIIEncoding)new ASCIIEncoding()).GetString(textBytesASCII)}";
var utf = $"{new UTF8Encoding().ToString()}: {((UTF8Encoding)new UTF8Encoding()).GetString(textBytesUTF8)}";
var unicode = $"{new UnicodeEncoding().ToString()}: {((UnicodeEncoding)new UnicodeEncoding()).GetString(textBytesCyrillic)}";

Wrote a cycle for converting unicode symbols in string to UTF8 letters:
string stringWithUnicodeSymbols = #"{""id"": 10440119, ""photo"": 10945418, ""first_name"": ""\u0415\u0432\u0433\u0435\u043d\u0438\u0439""}";
var splitted = Regex.Split(stringWithUnicodeSymbols, #"\\u([a-fA-F\d]{4})");
string outString = "";
foreach (var s in splitted)
{
try
{
if (s.Length == 4)
{
var decoded = ((char) Convert.ToUInt16(s, 16)).ToString();
outString += decoded;
}
else
{
outString += s;
}
}
catch (Exception e)
{
outString += s;
}
}

C# Email subject parsing

I'm building a system for reading emails in C#. I've got a problem parsing the subject, a problem which I think is related to encoding.
The subject I'm reading is as follows: =?ISO-8859-1?Q?=E6=F8sd=E5f=F8sdf_sdfsdf?=, the original subject sent is æøsdåføsdf sdfsdf (Norwegian characters in there).
Any ideas how I can change encoding or parse this correctly? So far I've tried to use the C# encoding conversion techniques to encode the subject to utf8, but without any luck.
Here is one of the solutions I tried:
Encoding iso = Encoding.GetEncoding("iso-8859-1");
Encoding utf = Encoding.UTF8;
string decodedSubject =
utf.GetString(Encoding.Convert(utf, iso,
iso.GetBytes(m.Subject.Split('?')[3])));

The encoding is called quoted printable.
See the answers to this question.
Adapted from the accepted answer:
public string DecodeQuotedPrintable(string value)
{
Attachment attachment = Attachment.CreateAttachmentFromString("", value);
return attachment.Name;
}
When passed the string =?ISO-8859-1?Q?=E6=F8sd=E5f=F8sdf_sdfsdf?= this returns "æøsdåføsdf_sdfsdf".

public static string DecodeEncodedWordValue(string mimeString)
{
var regex = new Regex(#"=\?(?<charset>.*?)\?(?<encoding>[qQbB])\?(?<value>.*?)\?=");
var encodedString = mimeString;
var decodedString = string.Empty;
while (encodedString.Length > 0)
{
var match = regex.Match(encodedString);
if (match.Success)
{
// If the match isn't at the start of the string, copy the initial few chars to the output
decodedString += encodedString.Substring(0, match.Index);
var charset = match.Groups["charset"].Value;
var encoding = match.Groups["encoding"].Value.ToUpper();
var value = match.Groups["value"].Value;
if (encoding.Equals("B"))
{
// Encoded value is Base-64
var bytes = Convert.FromBase64String(value);
decodedString += Encoding.GetEncoding(charset).GetString(bytes);
}
else if (encoding.Equals("Q"))
{
// Encoded value is Quoted-Printable
// Parse looking for =XX where XX is hexadecimal
var regx = new Regex("(\\=([0-9A-F][0-9A-F]))", RegexOptions.IgnoreCase);
decodedString += regx.Replace(value, new MatchEvaluator(delegate(Match m)
{
var hex = m.Groups[2].Value;
var iHex = Convert.ToInt32(hex, 16);
// Return the string in the charset defined
var bytes = new byte[1];
bytes[0] = Convert.ToByte(iHex);
return Encoding.GetEncoding(charset).GetString(bytes);
}));
decodedString = decodedString.Replace('_', ' ');
}
else
{
// Encoded value not known, return original string
// (Match should not be successful in this case, so this code may never get hit)
decodedString += encodedString;
break;
}
// Trim off up to and including the match, then we'll loop and try matching again.
encodedString = encodedString.Substring(match.Index + match.Length);
}
else
{
// No match, not encoded, return original string
decodedString += encodedString;
break;
}
}
return decodedString;
}

C# Convert string from UTF-8 to ISO-8859-1 (Latin1) H

I have googled on this topic and I have looked at every answer, but I still don't get it.
Basically I need to convert UTF-8 string to ISO-8859-1 and I do it using following code:
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
string msg = iso.GetString(utf8.GetBytes(Message));
My source string is
Message = "ÄäÖöÕõÜü"
But unfortunately my result string becomes
msg = "Ã?Ã¤Ã?Ã¶Ã?ÃµÃ?Ã¼
What I'm doing wrong here?

Use Encoding.Convert to adjust the byte array before attempting to decode it into your destination encoding.
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(Message);
byte[] isoBytes = Encoding.Convert(utf8, iso, utfBytes);
string msg = iso.GetString(isoBytes);

I think your problem is that you assume that the bytes that represent the utf8 string will result in the same string when interpreted as something else (iso-8859-1). And that is simply just not the case. I recommend that you read this excellent article by Joel spolsky.

Try this:
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(Message);
byte[] isoBytes = Encoding.Convert(utf8,iso,utfBytes);
string msg = iso.GetString(isoBytes);

You need to fix the source of the string in the first place.
A string in .NET is actually just an array of 16-bit unicode code-points, characters, so a string isn't in any particular encoding.
It's when you take that string and convert it to a set of bytes that encoding comes into play.
In any case, the way you did it, encoded a string to a byte array with one character set, and then decoding it with another, will not work, as you see.
Can you tell us more about where that original string comes from, and why you think it has been encoded wrong?

Seems bit strange code. To get string from Utf8 byte stream all you need to do is:
string str = Encoding.UTF8.GetString(utf8ByteArray);
If you need to save iso-8859-1 byte stream to somewhere then just use:
additional line of code for previous:
byte[] iso88591data = Encoding.GetEncoding("ISO-8859-1").GetBytes(str);

Maybe it can help
Convert one codepage to another:
public static string fnStringConverterCodepage(string sText, string sCodepageIn = "ISO-8859-8", string sCodepageOut="ISO-8859-8")
{
string sResultado = string.Empty;
try
{
byte[] tempBytes;
tempBytes = System.Text.Encoding.GetEncoding(sCodepageIn).GetBytes(sText);
sResultado = System.Text.Encoding.GetEncoding(sCodepageOut).GetString(tempBytes);
}
catch (Exception)
{
sResultado = "";
}
return sResultado;
}
Usage:
string sMsg = "ERRO: NÃ£o foi possivel acessar o servico de AutenticaÃ§Ã£o";
var sOut = fnStringConverterCodepage(sMsg ,"ISO-8859-1","UTF-8"));
Output:
"Não foi possivel acessar o servico de Autenticação"

Encoding targetEncoding = Encoding.GetEncoding(1252);
// Encode a string into an array of bytes.
Byte[] encodedBytes = targetEncoding.GetBytes(utfString);
// Show the encoded byte values.
Console.WriteLine("Encoded bytes: " + BitConverter.ToString(encodedBytes));
// Decode the byte array back to a string.
String decodedString = Encoding.Default.GetString(encodedBytes);

Just used the Nathan's solution and it works fine. I needed to convert ISO-8859-1 to Unicode:
string isocontent = Encoding.GetEncoding("ISO-8859-1").GetString(fileContent, 0, fileContent.Length);
byte[] isobytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(isocontent);
byte[] ubytes = Encoding.Convert(Encoding.GetEncoding("ISO-8859-1"), Encoding.Unicode, isobytes);
return Encoding.Unicode.GetString(ubytes, 0, ubytes.Length);

Here is a sample for ISO-8859-9;
protected void btnKaydet_Click(object sender, EventArgs e)
{
Response.Clear();
Response.Buffer = true;
Response.ContentType = "application/vnd.openxmlformatsofficedocument.wordprocessingml.documet";
Response.AddHeader("Content-Disposition", "attachment; filename=XXXX.doc");
Response.ContentEncoding = Encoding.GetEncoding("ISO-8859-9");
Response.Charset = "ISO-8859-9";
EnableViewState = false;
StringWriter writer = new StringWriter();
HtmlTextWriter html = new HtmlTextWriter(writer);
form1.RenderControl(html);
byte[] bytesInStream = Encoding.GetEncoding("iso-8859-9").GetBytes(writer.ToString());
MemoryStream memoryStream = new MemoryStream(bytesInStream);
string msgBody = "";
string Email = "mail#xxxxxx.org";
SmtpClient client = new SmtpClient("mail.xxxxx.org");
MailMessage message = new MailMessage(Email, "mail#someone.com", "ONLINE APP FORM WITH WORD DOC", msgBody);
Attachment att = new Attachment(memoryStream, "XXXX.doc", "application/vnd.openxmlformatsofficedocument.wordprocessingml.documet");
message.Attachments.Add(att);
message.BodyEncoding = System.Text.Encoding.UTF8;
message.IsBodyHtml = true;
client.Send(message);}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Utf7Encoding Text truncation - c#

Your are not transating the string to utf7 bytes correctly. You should call utf.GetBytes() instead of casting the characters to a byte. I suspect that in utf7 the ascii code corresponding to '+' is actually reserved for encoding international unicode characters.

Related

How to remove unknown chars on string in windows-1251 charset

How to convert a string to UTF8?

Unicode-to-string conversion in C#

C# Email subject parsing

C# Convert string from UTF-8 to ISO-8859-1 (Latin1) H

Categories

Resources