ABCpdf 5 Problems with encoding (special characters)

ABCpdf 5 Problems with encoding (special characters) - c#

I am using ABCpdf Version 5 in order to render some html-pages into PDFs.
I basically use HttpServerUtility.Execute() - Method in order to retrieve the html for the pdf:
System.IO.StringWriter writer = new System.IO.StringWriter();
server.Execute(requestUrl, writer);
string pageResult = writer.ToString();
WebSupergoo.ABCpdf5.Doc pdfDoc = new WebSupergoo.ABCpdf5.Doc();
pdfDoc.AddImageHtml(pageResult);
response.Buffer = false;
response.ContentType = "application/pdf";
response.AddHeader("Content-Disposition", "attachment;filename=MyPdf_" +
FormatDate(DateTime.Now, "yyyy-MM-dd") + ".pdf");
response.BinaryWrite(pdfDoc.GetData());
Now some special characters like Umlaute (äöü) are replaced with an empty space. Interestingly not all of them. What I did figure out:
Within the html-page I have.
`<meta http-equiv="content-type" content="text/xhtml; charset=utf-8" />`
If I parse this away, all special chars are rendered correctly. But this seems to me like an ugly hack.
In earlier days I did not use HttpServerUtility.Execute(), but I let ABCpdf call the URL itself: pdfDoc.AddImageUrl("someUrl");. There I had no such encoding-problems.
What could I try else?

Just came across this problem with ABCpdf 8.
In your code you retrieve HTML contents and pass the pageResult to AddImageHtml(). As the documentation states,
ABCpdf saves this HTML into a temporary file and renders the file
using a 'file://' protocol specifier.
What is not mentioned is that the temp file is UTF-8 encoded, but the encoding is not stated in the HTML file.
The <meta> tag actually sets the required encoding, and solved my problem.
One way to avoid the declaration of the encoding is to use the AddImageUrl() method that I expect to detect the HTML encoding from the HTTP/HTML response.

Encoding meta tag and AddImageURL method perhaps helps with simple document, but not in a chain situation, where encoding somehow gets lost despite encoding tag. I encountered this problem (exactly as described in original question - some foreign characters such as umlauts would disappear), and see no solution. I am considering getting rid of ABCPDF altogether and replace it with SSRS, which can render PDF formats.

Related

itext PDF link doesn't work in Microsoft Edge

I'm adding a link to a file from a pdf document (created with itext)
this way:
Chunk chunk = new Chunk(fileName, font);
chunk.SetAnchor("./relative/path/to/file");
Link works great if I open document in Google Chrome or Adobe reader.
But it doesn't work if I open my PDF in Microsoft Edge.
Is it even possible to create a file link inside pdf with itext that will work in Microsoft Edge? If yes, then how?

Is it even possible to create a file link inside pdf with itext that will work in Microsoft Edge?
If yes, then how?
Having done some tests it appears that Edge does not support relative links in PDF documents.
It does support absolute links, though, given the full URI, e.g.
chunk = new Chunk("Only ASCII chars in target. Full path.");
chunk.SetAnchor("file:///C:/Repo/GitHub/testarea/itext5/target/test-outputs/annotate/Attachments/1.png");
doc.Add(new Paragraph(chunk));
In contrast to other PDF viewers (Adobe Reader, Chrome, cf. your previous question in this context) it does not support URL encoding of special characters like Cyrillic ones:
chunk = new Chunk("Cyrillic chars in target. URL-encoded. Full path. NOT WORKING");
chunk.SetAnchor("file:///C:/Repo/GitHub/testarea/itext5/target/test-outputs/annotate/" + WebUtility.UrlEncode("Вложения") + "/1.png");
doc.Add(new Paragraph(chunk));
But it does support the special characters in UTF-8 encoding. As UTF-8 PdfString encoding is a PDF-2.0 feature and iText 5 does not support PDF-2.0, one has to cheat a bit to inject strings in UTF-8 encoding here:
chunk = new Chunk("Cyrillic chars in target. Action manipulated. Full path.");
chunk.SetAnchor("XXX");
action = (PdfAction)chunk.Attributes[Chunk.ACTION];
action.Put(PdfName.URI, new PdfString(new UTF8Encoding().GetBytes("file:///C:/Repo/GitHub/testarea/itext5/target/test-outputs/annotate/Вложения/1.png")));
doc.Add(new Paragraph(chunk));
Tested with Edge 41.16299.666.0

some special character is replace by '?' while generating pdf from html

I am trying to generate pdf from html file using itextsharp library, but I have one issue in that when I convert html into pdf, some special character of html file is replace by '?' sign. (ex €)
here is my code :
var elements = XMLWorkerHelper.ParseToElementList(html, null);
foreach (var element in elements)
{
document.Add(element);
}
XMLWorkerHelper is a class of itextsharp library.
I just want that my pdf is generate same as my html file.

If you use XMLWorkerHelper.ParseToElementList(String, String) (which you are) then iTextSharp is going to ask the .Net runtime to figure out the contents of the file by calling System.Text.Encoding.Default.GetBytes().
Per the docs, System.Text.Encoding.Default
Gets an encoding for the operating system's current ANSI code page
And further (emphasis mine):
Different computers can use different encodings as the default, and the default encoding can even change on a single computer. Therefore, data streamed from one computer to another or even retrieved at different times on the same computer might be translated incorrectly. In addition, the encoding returned by the Default property uses best-fit fallback to map unsupported characters to characters supported by the code page. For these two reasons, using the default encoding is generally not recommended. To ensure that encoded bytes are decoded properly, you should use a Unicode encoding, such as UTF8Encoding or UnicodeEncoding, with a preamble. Another option is to use a higher-level protocol to ensure that the same format is used for encoding and decoding.
So from the above you'll see that in the absence of any information in the file about how the raw bytes are intended to be interpreted, .Net will just use the local code page to interpret them. What's really fun is if you move your code 100% exactly as-is to another machine you might get different results because that machine might have a different code page set.
The best solution is to avoid code pages completely. To do this, just save the file as Unicode compatible format such as UTF8 and include a BOM to explicitly declare your intentions. The BOM is optional (and frowned upon by some people) but it is also the most explicit way in the absence of other information (such as HTTP headers or post-it notes).
The second option is to just re-implement XMLWorkerHelper.ParseToElementList() with your appropriate encoding. SourceForge is apparently down right now so here's the body of that method:
/**
* Parses an HTML string and a string containing CSS into a list of Element objects.
* The FontProvider will be obtained from iText's FontFactory object.
*
* #param html a String containing an XHTML snippet
* #param css a String containing CSS
* #return an ElementList instance
*/
public static ElementList ParseToElementList(String html, String css) {
// CSS
ICSSResolver cssResolver = new StyleAttrCSSResolver();
if (css != null) {
ICssFile cssFile = XMLWorkerHelper.GetCSS(new MemoryStream(Encoding.Default.GetBytes(css)));
cssResolver.AddCss(cssFile);
}
// HTML
CssAppliers cssAppliers = new CssAppliersImpl(FontFactory.FontImp);
HtmlPipelineContext htmlContext = new HtmlPipelineContext(cssAppliers);
htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
htmlContext.AutoBookmark(false);
// Pipelines
ElementList elements = new ElementList();
ElementHandlerPipeline end = new ElementHandlerPipeline(elements, null);
HtmlPipeline htmlPipeline = new HtmlPipeline(htmlContext, end);
CssResolverPipeline cssPipeline = new CssResolverPipeline(cssResolver, htmlPipeline);
// XML Worker
XMLWorker worker = new XMLWorker(cssPipeline, true);
XMLParser p = new XMLParser(worker);
p.Parse(new MemoryStream(Encoding.Default.GetBytes(html)));
return elements;
}
The second to last line of code that starts p.Parse is what you'd want to change. Since we don't know what the bytes of your file are (and neither does your computer, apparently) we can't tell you what to switch the encoder over to.
Just to wrap up, this actually isn't an iTextSharp problem at all, this is actually the default behavior of the .Net runtime. iTextSharp is just using system default in the absence of information.

p.parse(new StringReader(html));
this worked for me

C# decoding "â„¢" to "TM"

on a web page there is following string
"Qualcomm Snapdragon™ S4"
when i get this string in my .net code the string convert to "Qualcomm Snapdragonâ„¢ S4"
the character "TM" change to â„¢
how can i decode "â„¢" back to "TM"
Update
follwoing is the code for downloaded string using webproxy
wc is webproxy
wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8");
string html = Server.HtmlEncode(wc.DownloadString(url));

You should read the webpage in its proper encoding in the first place. In this case it seems you are reading with Encoding.Default (i.e. probably CP1252) and the page is really in UTF-8. This should be apparent either by reading the Content-Type header of the response or by looking for a <meta http-equiv="Content-Type" content='text/html; charset=utf-8'> in the content.
If you still need to do this after the fact, then use
var bytes = Encoding.Default.GetBytes(myString);
var correctString = Encoding.UTF8.GetString(bytes);
In any case you would need to know the exact encodings that were used on the page and for reading the malformed string in the first place. Furthermore I'd generally advise explicitly against using Encoding.Default because its value isn't fixed. It's just the legacy encoding on a Windows system for use in non-Unicode applications and also gets used as the default non-Unicode text file encoding. It should have no place whatsoever in handling external resources.

PrinceXML: "Input is not proper UTF-8"

I'm generating HTML from a database and then sending it to PrinceXML for conversion to PDF. The code I use to do this is:
string _htmlTemplate = #"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd""><html lang=""en-GB"" xml:lang=""en-GB"" xmlns=""http://www.w3.org/1999/xhtml""><head><meta http-equiv=""Content-type"" content=""text/html;charset=UTF-8"" /><title>Generated PDF Contract</title></head><body>{0}</body></html>";
string _pgeContent = string.Format(_htmlTemplate, sb.ToString());
writer.Write(sb.ToString());
Byte[] arrBytes = UTF8Encoding.Default.GetBytes(_pgeContent);
Stream s = new MemoryStream(arrBytes);
Prince princeConverter = new Prince(ConfigurationManager.AppSettings["PrinceXMLInstallLoc"].ToString());
princeConverter.SetLog(ConfigurationManager.AppSettings["PrinceXMLLogLoc"]);
princeConverter.AddStyleSheet(Server.MapPath(ConfigurationManager.AppSettings["FormsDocGenCssLocl"]));
Response.ClearContent();
Response.ClearHeaders();
Response.ContentType = "application/pdf";
Response.BufferOutput = true;
However, conversion fails with the error:
Input is not proper UTF-8, indicate encoding ! Bytes: 0xA0 0x77 0x65 0x62
I've taken the generated html and uploaded it to the W3C validator. It validates the markup as UTF-8 encoded XHTML 1.0 Transitional with no errors or warnings.
I've also gone through the file with a fine tooth-comb looking for invalid characters. So far nothing.
Can anyone suggest something else I could try?

Well after an afternoon of muttering curses and tearing out what is left of my hair, I figured out a fix for my particular problem.
It would appear that System.Text.UTF8Encoding doesn't output a UTF-8 identifier byte by default. So in my case I needed to use the constructor that takes a boolean parameter to control output of this.
UTF8Encoding u8enc = new UTF8Encoding(true);//Ensures a UTF8 identifier is emitted.
After this it was all good. Hope this helps someone :-)

Unicode in Content-Disposition header

I am using HttpContext object implemented in HttpHandler child to download a file, when I have non-ascii characters in file name it looks weird in IE whereas it looks fine in Firefox.
below is the code:-
context.Response.ContentType = ".cs";
context.Response.AppendHeader("Content-Length", data.Length.ToString());
context.Response.AppendHeader("Content-Disposition", String.Format("attachment; filename={0}",filename));
context.Response.OutputStream.Write(data, 0, data.Length);
context.Response.Flush();
when I supply 'ÃŸ' 'Ã¤' 'Ã¶' 'Ã¼' 'Ã³' 'ÃŸ' 'Ã¤' 'Ã¶' 'Ã¼' 'Ã³' in file name field it looks different than what I have in file name it looks fine in firefox. adding EncodingType and charset has been of no use.
In ie it is 'ÃƒÅ¸''ÃƒÂ¤''ÃƒÂ¶''ÃƒÂ¼''ÃƒÂ³''ÃƒÅ¸''ÃƒÂ¤''ÃƒÂ¶''ÃƒÂ¼'_'ÃƒÂ³' and in firefox it is 'ÃŸ' 'Ã¤' 'Ã¶' 'Ã¼' 'Ã³' 'ÃŸ' 'Ã¤' 'Ã¶' 'Ã¼' 'Ã³'.
Any Idea how this can be fixed?

I had similar problem. You have to use HttpUtility.UrlEncode or Server.UrlEncode to encode filename. Also I remember firefox didn't need it. Moreoverit ruined filename when it's url-encoded. My code:
// IE needs url encoding, FF doesn't support it, Google Chrome doesn't care
if (Request.Browser.IsBrowser ("IE"))
{
fileName = Server.UrlEncode(fileName);
}
Response.Clear ();
Response.AddHeader ("content-disposition", String.Format ("attachment;filename=\"{0}\"", fileName));
Response.AddHeader ("Content-Length", data.Length.ToString (CultureInfo.InvariantCulture));
Response.ContentType = mimeType;
Response.BinaryWrite(data);
Edit
I have read specification more carefully. First of all RFC2183 states that:
Current [RFC 2045] grammar restricts parameter values (and hence Content-Disposition filenames) to US-ASCII.
But then I found references that [RFC 2045] is absolete and one must reference RFC 2231, which states:
Asterisks ("*") are reused to provide
the indicator that language and
character set information is present
and encoding is being used. A single
quote ("'") is used to delimit the
character set and language information
at the beginning of the parameter
value. Percent signs ("%") are used as
the encoding flag, which agrees with
RFC 2047.
Which means that you can use UrlEncode for non-ascii symbols, as long as you include the encoding as stated in the rfc. Here is an example:
string.Format("attachment; filename=\"{0}\"; filename*=UTF-8''{0}", Server.UrlEncode(fileName, Encoding.UTF8));
Note that filename is included in addition to filename* for backwards compatibility. You can also choose another encoding and modify the parameter accordingly, but UTF-8 covers everything.

HttpUtility.UrlPathEncode might be a better option. As URLEncode will replace spaces with '+' signs.

For me this solution is working on all major browsers:
Response.AppendHeader("Content-Disposition", string.Format("attachment; filename*=UTF-8''{0}", HttpUtility.UrlPathEncode(fileName).Replace(",", "%2C"));
var mime = MimeMapping.GetMimeMapping(fileName);
return File(fileName, mime);
Using ASP.NET MVC 3.
The Replace is necessary, because Chrome doesn't like Comma (,) in parameter values: http://www.gangarasa.com/lets-Do-GoodCode/tag/err_response_headers_multiple_content_disposition/

You may want to read RFC 6266 and look at the tests at http://greenbytes.de/tech/tc2231/.

For me this solved the problem:
var result = new HttpResponseMessage(HttpStatusCode.OK)
{
Content = new ByteArrayContent(data)
};
result.Content.Headers.ContentDisposition = new ContentDispositionHeaderValue("attachment")
{
FileNameStar = "foo-ä-€.html"
};
When i look ad the repsonse in fiddler i can see the filename has automaticcaly been encoded using UTF-8:
Fiddler response example with encoded Content-Disposition filename using UTF-8
If we look at the value of the Content-Disposition header we can see it will be the same as #Johannes Geyer his answer. The only difference is that we didn't have to do the encoding ourselfs, the ContentDispositionHeaderValue class takes care of that.
I used the Testcases for the Content-Disposition header on: http://greenbytes.de/tech/tc2231/ as mentioned by Julian Reschke.
Information about the ContentDispositionHeaderValue class can be found on MSDN.

For Asp.Net Core (version 2 as of this post) UrlPathEncode is deprecated, here's how to achieve the desired result:
System.Net.Mime.ContentDisposition cd = new System.Net.Mime.ContentDisposition
{
FileName = Uri.EscapeUriString(fileName),
Inline = true // false = prompt the user for downloading; true = browser to try to show the file inline
};
Response.Headers.Add("Content-Disposition", cd.ToString());

I`m using Uri.EscapeUriString for converts all characters to their hexadecimal representation, and string.Normalize for Unicode normalization form C.
(tested in ASP.NET MVC5 framework 4.5)
var contentDispositionHeader = new System.Net.Mime.ContentDisposition
{
Inline = false,
FileName = Uri.EscapeUriString(Path.GetFileName(pathFile)).Normalize()
};
Response.Headers.Add("Content-Disposition", contentDispositionHeader.ToString());
string mimeType = MimeMapping.GetMimeMapping(Server.MapPath(pathFile));
return File(file, mimeType);

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

ABCpdf 5 Problems with encoding (special characters) - c#

Related

itext PDF link doesn't work in Microsoft Edge

some special character is replace by '?' while generating pdf from html

C# decoding "â„¢" to "TM"

PrinceXML: "Input is not proper UTF-8"

Unicode in Content-Disposition header

Categories

Resources