Getting unknown characters while decoding string in c#

Getting unknown characters while decoding string in c# - c#

I am working on a project where I am submitting my form data as xml and storing it in xml form to my db.
While decoding the xml data in c# I am getting unknown characters. Actually I am saving spanish text for eg " Introduzca texto aquí ".
So in xml I get this 'í' character as %ED and while decoding it using HttpUtility.UrlDecode(formData) I am getting � instead of í.
XML Data before decoding
%3CArrayOfDiary%3E%3CDiary%3E%3CDate%3E03042015%3C/Date%3E%3CSituation%3EIntroduzca%20texto%20aqu%ED%3C/Situation%3E%3CSensation%3EIntroduzca%20texto%20aqu%ED%3C/Sensation%3E%3CConcern%3EIntroduzca%20texto%20aqu%ED%3C/Concern%3E%3CBeliefRating%3E0%3C/BeliefRating%3E%3CAnxietyRating%3E0%3C/AnxietyRating%3E%3C/Diary%3E%0A%20%20%3CArrayOfDiary%3E
Data after decoding
<ArrayOfDiary><Diary><Date>03042015</Date><Situation>Introduzca texto aqu�</Situation><Sensation>Introduzca texto aqu�</Sensation><Concern>Introduzca texto aqu�</Concern><BeliefRating>0</BeliefRating><AnxietyRating>0</AnxietyRating></Diary>
<Diary>
<Date>03042015</Date>
<Situation> Introduzca texto aqu�</Situation>
<Sensation> Introduzca texto aqu�</Sensation>
<Concern> Introduzca texto aqu�</Concern>
<BeliefRating>0</BeliefRating>
<AnxietyRating>0</AnxietyRating>
</Diary>
</ArrayOfDiary>
Please help me. Thanks

Without seeing where the data is coming from, I assume that it has been created with an encoding of ISO-8859-1.
You can get around the problem by using the appropriate Encoding in UrlDecode:
Option Infer On
' ....
Dim s = "%3CArrayOfDiary%3E%3CDiary%3E%3CDate%3E03042015%3C/Date%3E%3CSituation%3EIntroduzca%20texto%20aqu%ED%3C/Situation%3E%3CSensation%3EIntroduzca%20texto%20aqu%ED%3C/Sensation%3E%3CConcern%3EIntroduzca%20texto%20aqu%ED%3C/Concern%3E%3CBeliefRating%3E0%3C/BeliefRating%3E%3CAnxietyRating%3E0%3C/AnxietyRating%3E%3C/Diary%3E%0A%20%20%3CArrayOfDiary%3E"
Dim enc = Encoding.GetEncoding("ISO-8859-1")
Dim txt = Web.HttpUtility.UrlDecode(s, enc)
To avoid that hassle, you could use <meta charset="utf-8" /> in the <head> section of the web page. You can still have <html lang="es"> if you want to indicate that the page is in Spanish.

https://msdn.microsoft.com/en-us/library/ms246590.aspx
Check out this page it should help you, it sounds like you may not have the right LIP

Related

C# - create and write HTML file with variables

Basiclly I'm trying to create an HTML, I already have it written but I want the user to be able to put some text on the textboxes and saving it into strings and use later when creating the HTML file.
I tried playing abit with StreamWriter but I don't think that will be the best idea.
Also I want it to open on the default web browser , or just on IE if it's easier after the file is created.
I really need help as I'm struggling especially with the creating part.
Thanks for reading!

You can also do this without external libraries.
Set up your HTML file as follows:
<!DOCTYPE html>
<html>
<header>
<title>{MY_TITLE}</title>
</header>
<body></body>
</html>
Then edit and save the HTML from C#:
const string fileName = "Foobar.html";
//Read HTML from file
var content = File.ReadAllText(fileName);
//Replace all values in the HTML
content = content.Replace("{MY_TITLE}", titleTextBox.Text);
//Write new HTML string to file
File.WriteAllText(fileName, content);
//Show it in the default application for handling .html files
Process.Start(fileName);

If you already have the HTML you want to export (just not customized), you could manually add format strings to it (like {0}, {1}, {2}) where you want to substitute text from your app, then embed it as a resource, load it in at runtime, substitute the TextBox text using string.Format, and finally write it out again. This is admittedly a really fragile way to do it, as you need to make sure the number of parameters agrees between the resource file and your call to string.Format. In fact, this is a horrible way to do it. Actually, you should do it the way #EmilePels suggests, which is basically a less fragile version of this answer.

C# decoding "â„¢" to "TM"

on a web page there is following string
"Qualcomm Snapdragon™ S4"
when i get this string in my .net code the string convert to "Qualcomm Snapdragonâ„¢ S4"
the character "TM" change to â„¢
how can i decode "â„¢" back to "TM"
Update
follwoing is the code for downloaded string using webproxy
wc is webproxy
wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8");
string html = Server.HtmlEncode(wc.DownloadString(url));

You should read the webpage in its proper encoding in the first place. In this case it seems you are reading with Encoding.Default (i.e. probably CP1252) and the page is really in UTF-8. This should be apparent either by reading the Content-Type header of the response or by looking for a <meta http-equiv="Content-Type" content='text/html; charset=utf-8'> in the content.
If you still need to do this after the fact, then use
var bytes = Encoding.Default.GetBytes(myString);
var correctString = Encoding.UTF8.GetString(bytes);
In any case you would need to know the exact encodings that were used on the page and for reading the malformed string in the first place. Furthermore I'd generally advise explicitly against using Encoding.Default because its value isn't fixed. It's just the legacy encoding on a Windows system for use in non-Unicode applications and also gets used as the default non-Unicode text file encoding. It should have no place whatsoever in handling external resources.

How to encode string for ID3 tags in C#

I am having problems with encoding in ID3 tags. I query a webservice which returns back some XML including a node such as the one below:
<name>Blue Öyster Cult</name>
I am then using this information to update my ID3 tags. The problem is that the tag is updated as:
Blue Ã–yster Cult
I know this is an encoding issue, but I'm struggling to work out how to get it to work. My understanding is that ID3 tags need to be encoded as ISO-8859-1.
I wrote this code, but it makes no difference:
Encoding newEncoding = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = Encoding.UTF8.GetBytes(newArtistName);
byte[] asciBytes = Encoding.Convert(utf8, newEncoding, utfBytes);
string encodedArtistName = newEncoding.GetString(asciBytes);
Is this in the right direction or not?
Any advice much appreciated.

ID3 v2
Textual frames are marked with an encoding byte.
$00 – ISO-8859-1 (ASCII).
$01 – UCS-2 (UTF-16 encoded Unicode with BOM), in ID3v2.2 and ID3v2.3.
$02 – UTF-16BE encoded Unicode without BOM, in ID3v2.4.
$03 – UTF-8 encoded Unicode, in ID3v2.4.
Detailed specification can be found at http://id3.org/id3v2-00.
Also see View/edit ID3 data for MP3 files - post with similar issue.

The issue that I was having was actually prior to saving to the ID3 tags. The XML I was receiving was set to UTF-8, but the WebClient requesting the page was not. Adding the second line shown here resolved the problem.
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
String htmlCode = client.DownloadString(requestURL);
When the value is extracting from this XML, it has the correct encoding to be saved to the files ID3 tag.

ABCpdf 5 Problems with encoding (special characters)

I am using ABCpdf Version 5 in order to render some html-pages into PDFs.
I basically use HttpServerUtility.Execute() - Method in order to retrieve the html for the pdf:
System.IO.StringWriter writer = new System.IO.StringWriter();
server.Execute(requestUrl, writer);
string pageResult = writer.ToString();
WebSupergoo.ABCpdf5.Doc pdfDoc = new WebSupergoo.ABCpdf5.Doc();
pdfDoc.AddImageHtml(pageResult);
response.Buffer = false;
response.ContentType = "application/pdf";
response.AddHeader("Content-Disposition", "attachment;filename=MyPdf_" +
FormatDate(DateTime.Now, "yyyy-MM-dd") + ".pdf");
response.BinaryWrite(pdfDoc.GetData());
Now some special characters like Umlaute (äöü) are replaced with an empty space. Interestingly not all of them. What I did figure out:
Within the html-page I have.
`<meta http-equiv="content-type" content="text/xhtml; charset=utf-8" />`
If I parse this away, all special chars are rendered correctly. But this seems to me like an ugly hack.
In earlier days I did not use HttpServerUtility.Execute(), but I let ABCpdf call the URL itself: pdfDoc.AddImageUrl("someUrl");. There I had no such encoding-problems.
What could I try else?

Just came across this problem with ABCpdf 8.
In your code you retrieve HTML contents and pass the pageResult to AddImageHtml(). As the documentation states,
ABCpdf saves this HTML into a temporary file and renders the file
using a 'file://' protocol specifier.
What is not mentioned is that the temp file is UTF-8 encoded, but the encoding is not stated in the HTML file.
The <meta> tag actually sets the required encoding, and solved my problem.
One way to avoid the declaration of the encoding is to use the AddImageUrl() method that I expect to detect the HTML encoding from the HTTP/HTML response.

Encoding meta tag and AddImageURL method perhaps helps with simple document, but not in a chain situation, where encoding somehow gets lost despite encoding tag. I encountered this problem (exactly as described in original question - some foreign characters such as umlauts would disappear), and see no solution. I am considering getting rid of ABCPDF altogether and replace it with SSRS, which can render PDF formats.

PrinceXML: "Input is not proper UTF-8"

I'm generating HTML from a database and then sending it to PrinceXML for conversion to PDF. The code I use to do this is:
string _htmlTemplate = #"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd""><html lang=""en-GB"" xml:lang=""en-GB"" xmlns=""http://www.w3.org/1999/xhtml""><head><meta http-equiv=""Content-type"" content=""text/html;charset=UTF-8"" /><title>Generated PDF Contract</title></head><body>{0}</body></html>";
string _pgeContent = string.Format(_htmlTemplate, sb.ToString());
writer.Write(sb.ToString());
Byte[] arrBytes = UTF8Encoding.Default.GetBytes(_pgeContent);
Stream s = new MemoryStream(arrBytes);
Prince princeConverter = new Prince(ConfigurationManager.AppSettings["PrinceXMLInstallLoc"].ToString());
princeConverter.SetLog(ConfigurationManager.AppSettings["PrinceXMLLogLoc"]);
princeConverter.AddStyleSheet(Server.MapPath(ConfigurationManager.AppSettings["FormsDocGenCssLocl"]));
Response.ClearContent();
Response.ClearHeaders();
Response.ContentType = "application/pdf";
Response.BufferOutput = true;
However, conversion fails with the error:
Input is not proper UTF-8, indicate encoding ! Bytes: 0xA0 0x77 0x65 0x62
I've taken the generated html and uploaded it to the W3C validator. It validates the markup as UTF-8 encoded XHTML 1.0 Transitional with no errors or warnings.
I've also gone through the file with a fine tooth-comb looking for invalid characters. So far nothing.
Can anyone suggest something else I could try?

Well after an afternoon of muttering curses and tearing out what is left of my hair, I figured out a fix for my particular problem.
It would appear that System.Text.UTF8Encoding doesn't output a UTF-8 identifier byte by default. So in my case I needed to use the constructor that takes a boolean parameter to control output of this.
UTF8Encoding u8enc = new UTF8Encoding(true);//Ensures a UTF8 identifier is emitted.
After this it was all good. Hope this helps someone :-)

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Getting unknown characters while decoding string in c# - c#

https://msdn.microsoft.com/en-us/library/ms246590.aspx Check out this page it should help you, it sounds like you may not have the right LIP

Related

C# - create and write HTML file with variables

C# decoding "â„¢" to "TM"

How to encode string for ID3 tags in C#

ABCpdf 5 Problems with encoding (special characters)

PrinceXML: "Input is not proper UTF-8"

Categories

Resources