PrinceXML: "Input is not proper UTF-8" - c#

I'm generating HTML from a database and then sending it to PrinceXML for conversion to PDF. The code I use to do this is:
string _htmlTemplate = #"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd""><html lang=""en-GB"" xml:lang=""en-GB"" xmlns=""http://www.w3.org/1999/xhtml""><head><meta http-equiv=""Content-type"" content=""text/html;charset=UTF-8"" /><title>Generated PDF Contract</title></head><body>{0}</body></html>";
string _pgeContent = string.Format(_htmlTemplate, sb.ToString());
writer.Write(sb.ToString());
Byte[] arrBytes = UTF8Encoding.Default.GetBytes(_pgeContent);
Stream s = new MemoryStream(arrBytes);
Prince princeConverter = new Prince(ConfigurationManager.AppSettings["PrinceXMLInstallLoc"].ToString());
princeConverter.SetLog(ConfigurationManager.AppSettings["PrinceXMLLogLoc"]);
princeConverter.AddStyleSheet(Server.MapPath(ConfigurationManager.AppSettings["FormsDocGenCssLocl"]));
Response.ClearContent();
Response.ClearHeaders();
Response.ContentType = "application/pdf";
Response.BufferOutput = true;
However, conversion fails with the error:
Input is not proper UTF-8, indicate encoding ! Bytes: 0xA0 0x77 0x65 0x62
I've taken the generated html and uploaded it to the W3C validator. It validates the markup as UTF-8 encoded XHTML 1.0 Transitional with no errors or warnings.
I've also gone through the file with a fine tooth-comb looking for invalid characters. So far nothing.
Can anyone suggest something else I could try?

Well after an afternoon of muttering curses and tearing out what is left of my hair, I figured out a fix for my particular problem.
It would appear that System.Text.UTF8Encoding doesn't output a UTF-8 identifier byte by default. So in my case I needed to use the constructor that takes a boolean parameter to control output of this.
UTF8Encoding u8enc = new UTF8Encoding(true);//Ensures a UTF8 identifier is emitted.
After this it was all good. Hope this helps someone :-)

Related

Getting unknown characters while decoding string in c#

I am working on a project where I am submitting my form data as xml and storing it in xml form to my db.
While decoding the xml data in c# I am getting unknown characters. Actually I am saving spanish text for eg " Introduzca texto aquí ".
So in xml I get this 'í' character as %ED and while decoding it using HttpUtility.UrlDecode(formData) I am getting � instead of í.
XML Data before decoding
%3CArrayOfDiary%3E%3CDiary%3E%3CDate%3E03042015%3C/Date%3E%3CSituation%3EIntroduzca%20texto%20aqu%ED%3C/Situation%3E%3CSensation%3EIntroduzca%20texto%20aqu%ED%3C/Sensation%3E%3CConcern%3EIntroduzca%20texto%20aqu%ED%3C/Concern%3E%3CBeliefRating%3E0%3C/BeliefRating%3E%3CAnxietyRating%3E0%3C/AnxietyRating%3E%3C/Diary%3E%0A%20%20%3CArrayOfDiary%3E
Data after decoding
<ArrayOfDiary><Diary><Date>03042015</Date><Situation>Introduzca texto aqu�</Situation><Sensation>Introduzca texto aqu�</Sensation><Concern>Introduzca texto aqu�</Concern><BeliefRating>0</BeliefRating><AnxietyRating>0</AnxietyRating></Diary>
<Diary>
<Date>03042015</Date>
<Situation> Introduzca texto aqu�</Situation>
<Sensation> Introduzca texto aqu�</Sensation>
<Concern> Introduzca texto aqu�</Concern>
<BeliefRating>0</BeliefRating>
<AnxietyRating>0</AnxietyRating>
</Diary>
</ArrayOfDiary>
Please help me. Thanks
Without seeing where the data is coming from, I assume that it has been created with an encoding of ISO-8859-1.
You can get around the problem by using the appropriate Encoding in UrlDecode:
Option Infer On
' ....
Dim s = "%3CArrayOfDiary%3E%3CDiary%3E%3CDate%3E03042015%3C/Date%3E%3CSituation%3EIntroduzca%20texto%20aqu%ED%3C/Situation%3E%3CSensation%3EIntroduzca%20texto%20aqu%ED%3C/Sensation%3E%3CConcern%3EIntroduzca%20texto%20aqu%ED%3C/Concern%3E%3CBeliefRating%3E0%3C/BeliefRating%3E%3CAnxietyRating%3E0%3C/AnxietyRating%3E%3C/Diary%3E%0A%20%20%3CArrayOfDiary%3E"
Dim enc = Encoding.GetEncoding("ISO-8859-1")
Dim txt = Web.HttpUtility.UrlDecode(s, enc)
To avoid that hassle, you could use <meta charset="utf-8" /> in the <head> section of the web page. You can still have <html lang="es"> if you want to indicate that the page is in Spanish.
https://msdn.microsoft.com/en-us/library/ms246590.aspx
Check out this page it should help you, it sounds like you may not have the right LIP

HtmlAgilityPack - Convert MHTML To HTML as String

I have a MHTML file and I am trying to convert it to HTML.
I have installed the HtmlAgilityPack and tried the following code:
var doc = new HtmlAgilityPack.MixedCodeDocument();
doc.Load("C:\\Users\\DickTracey\\Downloads\\Club Membership Report.mhtml");
var ms = new MemoryStream();
var sw = new StreamWriter(ms);
doc.Save(sw);
ms.Position = 0;
var sr = new StreamReader(ms);
return sr.ReadToEnd();
But it always returns null.
Can anyone explain the correct procedure to convert MHTML to HTML please?
MHTML to HTML Decoding in C#!
string mhtml = "This is your MHTML string"; // Make sure the string is in UTF-8 encoding
MHTMLParser parser = new MHTMLParser(mhtml);
string html = parser.getHTMLText(); // This is the converted HTML
git link : https://github.com/DavidBenko/MHTML-to-HTML-Decoding-in-C-Sharp.git
I had a quick look at an MHTML file with HxD. Although, as noted above, HtmlAgilityPack has little or no support for MHTML, the format itself looks simple enough. It appears to consist of the usual suspects (unencoded HTML, CSS, JS, graphics encoded in Base64, etc) concatenated in a way (with mime type headers) that could be worked out with a little effort. Having said that, the format is probably fully documented somewhere -- so dust off your browser, write some C# to parse it, and spoon-feed HtmlAgilityPack with the results.

How to encode string for ID3 tags in C#

I am having problems with encoding in ID3 tags. I query a webservice which returns back some XML including a node such as the one below:
<name>Blue Öyster Cult</name>
I am then using this information to update my ID3 tags. The problem is that the tag is updated as:
Blue Öyster Cult
I know this is an encoding issue, but I'm struggling to work out how to get it to work. My understanding is that ID3 tags need to be encoded as ISO-8859-1.
I wrote this code, but it makes no difference:
Encoding newEncoding = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = Encoding.UTF8.GetBytes(newArtistName);
byte[] asciBytes = Encoding.Convert(utf8, newEncoding, utfBytes);
string encodedArtistName = newEncoding.GetString(asciBytes);
Is this in the right direction or not?
Any advice much appreciated.
ID3 v2
Textual frames are marked with an encoding byte.
$00 – ISO-8859-1 (ASCII).
$01 – UCS-2 (UTF-16 encoded Unicode with BOM), in ID3v2.2 and ID3v2.3.
$02 – UTF-16BE encoded Unicode without BOM, in ID3v2.4.
$03 – UTF-8 encoded Unicode, in ID3v2.4.
Detailed specification can be found at http://id3.org/id3v2-00.
Also see View/edit ID3 data for MP3 files - post with similar issue.
The issue that I was having was actually prior to saving to the ID3 tags. The XML I was receiving was set to UTF-8, but the WebClient requesting the page was not. Adding the second line shown here resolved the problem.
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
String htmlCode = client.DownloadString(requestURL);
When the value is extracting from this XML, it has the correct encoding to be saved to the files ID3 tag.

ABCpdf 5 Problems with encoding (special characters)

I am using ABCpdf Version 5 in order to render some html-pages into PDFs.
I basically use HttpServerUtility.Execute() - Method in order to retrieve the html for the pdf:
System.IO.StringWriter writer = new System.IO.StringWriter();
server.Execute(requestUrl, writer);
string pageResult = writer.ToString();
WebSupergoo.ABCpdf5.Doc pdfDoc = new WebSupergoo.ABCpdf5.Doc();
pdfDoc.AddImageHtml(pageResult);
response.Buffer = false;
response.ContentType = "application/pdf";
response.AddHeader("Content-Disposition", "attachment;filename=MyPdf_" +
FormatDate(DateTime.Now, "yyyy-MM-dd") + ".pdf");
response.BinaryWrite(pdfDoc.GetData());
Now some special characters like Umlaute (äöü) are replaced with an empty space. Interestingly not all of them. What I did figure out:
Within the html-page I have.
`<meta http-equiv="content-type" content="text/xhtml; charset=utf-8" />`
If I parse this away, all special chars are rendered correctly. But this seems to me like an ugly hack.
In earlier days I did not use HttpServerUtility.Execute(), but I let ABCpdf call the URL itself: pdfDoc.AddImageUrl("someUrl");. There I had no such encoding-problems.
What could I try else?
Just came across this problem with ABCpdf 8.
In your code you retrieve HTML contents and pass the pageResult to AddImageHtml(). As the documentation states,
ABCpdf saves this HTML into a temporary file and renders the file
using a 'file://' protocol specifier.
What is not mentioned is that the temp file is UTF-8 encoded, but the encoding is not stated in the HTML file.
The <meta> tag actually sets the required encoding, and solved my problem.
One way to avoid the declaration of the encoding is to use the AddImageUrl() method that I expect to detect the HTML encoding from the HTTP/HTML response.
Encoding meta tag and AddImageURL method perhaps helps with simple document, but not in a chain situation, where encoding somehow gets lost despite encoding tag. I encountered this problem (exactly as described in original question - some foreign characters such as umlauts would disappear), and see no solution. I am considering getting rid of ABCPDF altogether and replace it with SSRS, which can render PDF formats.

Unicode in Content-Disposition header

I am using HttpContext object implemented in HttpHandler child to download a file, when I have non-ascii characters in file name it looks weird in IE whereas it looks fine in Firefox.
below is the code:-
context.Response.ContentType = ".cs";
context.Response.AppendHeader("Content-Length", data.Length.ToString());
context.Response.AppendHeader("Content-Disposition", String.Format("attachment; filename={0}",filename));
context.Response.OutputStream.Write(data, 0, data.Length);
context.Response.Flush();
when I supply 'ß' 'ä' 'ö' 'ü' 'ó' 'ß' 'ä' 'ö' 'ü' 'ó' in file name field it looks different than what I have in file name it looks fine in firefox. adding EncodingType and charset has been of no use.
In ie it is 'ß''ä''ö''ü''ó''ß''ä''ö''ü'_'ó' and in firefox it is 'ß' 'ä' 'ö' 'ü' 'ó' 'ß' 'ä' 'ö' 'ü' 'ó'.
Any Idea how this can be fixed?
I had similar problem. You have to use HttpUtility.UrlEncode or Server.UrlEncode to encode filename. Also I remember firefox didn't need it. Moreoverit ruined filename when it's url-encoded. My code:
// IE needs url encoding, FF doesn't support it, Google Chrome doesn't care
if (Request.Browser.IsBrowser ("IE"))
{
fileName = Server.UrlEncode(fileName);
}
Response.Clear ();
Response.AddHeader ("content-disposition", String.Format ("attachment;filename=\"{0}\"", fileName));
Response.AddHeader ("Content-Length", data.Length.ToString (CultureInfo.InvariantCulture));
Response.ContentType = mimeType;
Response.BinaryWrite(data);
Edit
I have read specification more carefully. First of all RFC2183 states that:
Current [RFC 2045] grammar restricts parameter values (and hence Content-Disposition filenames) to US-ASCII.
But then I found references that [RFC 2045] is absolete and one must reference RFC 2231, which states:
Asterisks ("*") are reused to provide
the indicator that language and
character set information is present
and encoding is being used. A single
quote ("'") is used to delimit the
character set and language information
at the beginning of the parameter
value. Percent signs ("%") are used as
the encoding flag, which agrees with
RFC 2047.
Which means that you can use UrlEncode for non-ascii symbols, as long as you include the encoding as stated in the rfc. Here is an example:
string.Format("attachment; filename=\"{0}\"; filename*=UTF-8''{0}", Server.UrlEncode(fileName, Encoding.UTF8));
Note that filename is included in addition to filename* for backwards compatibility. You can also choose another encoding and modify the parameter accordingly, but UTF-8 covers everything.
HttpUtility.UrlPathEncode might be a better option. As URLEncode will replace spaces with '+' signs.
For me this solution is working on all major browsers:
Response.AppendHeader("Content-Disposition", string.Format("attachment; filename*=UTF-8''{0}", HttpUtility.UrlPathEncode(fileName).Replace(",", "%2C"));
var mime = MimeMapping.GetMimeMapping(fileName);
return File(fileName, mime);
Using ASP.NET MVC 3.
The Replace is necessary, because Chrome doesn't like Comma (,) in parameter values: http://www.gangarasa.com/lets-Do-GoodCode/tag/err_response_headers_multiple_content_disposition/
You may want to read RFC 6266 and look at the tests at http://greenbytes.de/tech/tc2231/.
For me this solved the problem:
var result = new HttpResponseMessage(HttpStatusCode.OK)
{
Content = new ByteArrayContent(data)
};
result.Content.Headers.ContentDisposition = new ContentDispositionHeaderValue("attachment")
{
FileNameStar = "foo-ä-€.html"
};
When i look ad the repsonse in fiddler i can see the filename has automaticcaly been encoded using UTF-8:
Fiddler response example with encoded Content-Disposition filename using UTF-8
If we look at the value of the Content-Disposition header we can see it will be the same as #Johannes Geyer his answer. The only difference is that we didn't have to do the encoding ourselfs, the ContentDispositionHeaderValue class takes care of that.
I used the Testcases for the Content-Disposition header on: http://greenbytes.de/tech/tc2231/ as mentioned by Julian Reschke.
Information about the ContentDispositionHeaderValue class can be found on MSDN.
For Asp.Net Core (version 2 as of this post) UrlPathEncode is deprecated, here's how to achieve the desired result:
System.Net.Mime.ContentDisposition cd = new System.Net.Mime.ContentDisposition
{
FileName = Uri.EscapeUriString(fileName),
Inline = true // false = prompt the user for downloading; true = browser to try to show the file inline
};
Response.Headers.Add("Content-Disposition", cd.ToString());
I`m using Uri.EscapeUriString for converts all characters to their hexadecimal representation, and string.Normalize for Unicode normalization form C.
(tested in ASP.NET MVC5 framework 4.5)
var contentDispositionHeader = new System.Net.Mime.ContentDisposition
{
Inline = false,
FileName = Uri.EscapeUriString(Path.GetFileName(pathFile)).Normalize()
};
Response.Headers.Add("Content-Disposition", contentDispositionHeader.ToString());
string mimeType = MimeMapping.GetMimeMapping(Server.MapPath(pathFile));
return File(file, mimeType);

Categories

Resources