some special character is replace by '?' while generating pdf from html - c#

I am trying to generate pdf from html file using itextsharp library, but I have one issue in that when I convert html into pdf, some special character of html file is replace by '?' sign. (ex €)
here is my code :
var elements = XMLWorkerHelper.ParseToElementList(html, null);
foreach (var element in elements)
{
document.Add(element);
}
XMLWorkerHelper is a class of itextsharp library.
I just want that my pdf is generate same as my html file.

If you use XMLWorkerHelper.ParseToElementList(String, String) (which you are) then iTextSharp is going to ask the .Net runtime to figure out the contents of the file by calling System.Text.Encoding.Default.GetBytes().
Per the docs, System.Text.Encoding.Default
Gets an encoding for the operating system's current ANSI code page
And further (emphasis mine):
Different computers can use different encodings as the default, and the default encoding can even change on a single computer. Therefore, data streamed from one computer to another or even retrieved at different times on the same computer might be translated incorrectly. In addition, the encoding returned by the Default property uses best-fit fallback to map unsupported characters to characters supported by the code page. For these two reasons, using the default encoding is generally not recommended. To ensure that encoded bytes are decoded properly, you should use a Unicode encoding, such as UTF8Encoding or UnicodeEncoding, with a preamble. Another option is to use a higher-level protocol to ensure that the same format is used for encoding and decoding.
So from the above you'll see that in the absence of any information in the file about how the raw bytes are intended to be interpreted, .Net will just use the local code page to interpret them. What's really fun is if you move your code 100% exactly as-is to another machine you might get different results because that machine might have a different code page set.
The best solution is to avoid code pages completely. To do this, just save the file as Unicode compatible format such as UTF8 and include a BOM to explicitly declare your intentions. The BOM is optional (and frowned upon by some people) but it is also the most explicit way in the absence of other information (such as HTTP headers or post-it notes).
The second option is to just re-implement XMLWorkerHelper.ParseToElementList() with your appropriate encoding. SourceForge is apparently down right now so here's the body of that method:
/**
* Parses an HTML string and a string containing CSS into a list of Element objects.
* The FontProvider will be obtained from iText's FontFactory object.
*
* #param html a String containing an XHTML snippet
* #param css a String containing CSS
* #return an ElementList instance
*/
public static ElementList ParseToElementList(String html, String css) {
// CSS
ICSSResolver cssResolver = new StyleAttrCSSResolver();
if (css != null) {
ICssFile cssFile = XMLWorkerHelper.GetCSS(new MemoryStream(Encoding.Default.GetBytes(css)));
cssResolver.AddCss(cssFile);
}
// HTML
CssAppliers cssAppliers = new CssAppliersImpl(FontFactory.FontImp);
HtmlPipelineContext htmlContext = new HtmlPipelineContext(cssAppliers);
htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
htmlContext.AutoBookmark(false);
// Pipelines
ElementList elements = new ElementList();
ElementHandlerPipeline end = new ElementHandlerPipeline(elements, null);
HtmlPipeline htmlPipeline = new HtmlPipeline(htmlContext, end);
CssResolverPipeline cssPipeline = new CssResolverPipeline(cssResolver, htmlPipeline);
// XML Worker
XMLWorker worker = new XMLWorker(cssPipeline, true);
XMLParser p = new XMLParser(worker);
p.Parse(new MemoryStream(Encoding.Default.GetBytes(html)));
return elements;
}
The second to last line of code that starts p.Parse is what you'd want to change. Since we don't know what the bytes of your file are (and neither does your computer, apparently) we can't tell you what to switch the encoder over to.
Just to wrap up, this actually isn't an iTextSharp problem at all, this is actually the default behavior of the .Net runtime. iTextSharp is just using system default in the absence of information.

p.parse(new StringReader(html));
this worked for me

Related

NReco HTML-to-PDF Generator GeneratePdfFromFiles method throws exception

I have a fully working system for creating single page PDFs from HTML as below;
After initializing the converter
var nRecoHTMLToPDFConverter = new HtmlToPdfConverter();
nRecoHTMLToPDFConverter = PDFGenerator.PDFSettings(nRecoHTMLToPDFConverter);
string PDFContents;
PDFContents is an HTML string which is being populated.
The following command works perfectly and gives me the byte[] which I can return;
createDTO.PDFContent = nRecoHTMLToPDFConverter.GeneratePdf(PDFContents);
The problem arises when I want to test and develop the multi page functionality of the NReco library and change an arbitrary number of HTML pages to PDF pages.
var stringArray = new string[]
{
PDFContents, PDFContents,
};
var stream = new MemoryStream();
nRecoHTMLToPDFConverter.GeneratePdfFromFiles(stringArray, null, stream);
var mybyteArray = stream.ToArray();
the PDFContents are exactly the same as above. On paper, this should give me the byte array for 2 identical PDF pages however on call to GeneratePdfFromFiles method, I get the following exception;
WkHtmlToPdfException: Exit with code 1 due to network error: HostNotFoundError (exit code: 1)
Please help me resolve this if you have experience with this library and its complexities. I have a feeling that I'm not familiar with the proper use of a Stream object in this scenario. I've tested the working single page line and the malfunctioning multi page lines on the same method call so their context would be identical.
Many thanks
GeneratePdfFromFiles method you used expects array of file names (or URLs): https://www.nrecosite.com/doc/NReco.PdfGenerator/?topic=html/M_NReco_PdfGenerator_HtmlToPdfConverter_GeneratePdfFromFiles_1.htm
If you operate with HTML content as .NET strings you may simply save it to temp files, generate PDF and remove after that.

How do I extract actual font names from a PDF with iTextSharp?

I am using iTextSharp for PDF processing, and I need to extract all text from an existing PDF that is written in a certain font.
A way to do that is to inherit from a RenderFilter and only allow text that has a certain PostscriptFontName. The problem is that when I do this, I see the following font names in the PDF:
CIDFont+F1
CIDFont+F2
CIDFont+F3
CIDFont+F4
CIDFont+F5
which is nothing like the actual font names I am looking for.
I have tried enumerating the font resources, and it shows the same result.
I have tried opening the PDF in the full Adobe Acrobat. It also shows the mangled font names:
I have tried analysing the file with iText RUPS. Same result.
That is, I have not been able to see the actual font names anywhere in the document structure.
Yet, Adobe Acrobat DC does show the correct font names in the Format pane when I select various text boxes on the document canvas (e.g. Arial, Courier New, Roboto), so that information must be stored somewhere.
How do I get those real font names when parsing PDFs with iTextSharp?
As determined in the course of the comments to the question, the font names are anonymized in all PDF metadata for the font but the embedded font program itself contains the actual font name.
(So the PDF strictly speaking is broken, even though in a way hardly any software will ever complain about.)
If we want to retrieve those names, therefore, we have to look inside these font programs.
Here a proof of concept following the architecture used in this answer you referenced, i.e. using a RenderFilter:
class FontProgramRenderFilter : RenderFilter
{
public override bool AllowText(TextRenderInfo renderInfo)
{
DocumentFont font = renderInfo.GetFont();
PdfDictionary fontDict = font.FontDictionary;
PdfName subType = fontDict.GetAsName(PdfName.SUBTYPE);
if (PdfName.TYPE0.Equals(subType))
{
PdfArray descendantFonts = fontDict.GetAsArray(PdfName.DESCENDANTFONTS);
PdfDictionary descendantFont = descendantFonts[0] as PdfDictionary;
PdfDictionary fontDescriptor = descendantFont.GetAsDict(PdfName.FONTDESCRIPTOR);
PdfStream fontStream = fontDescriptor.GetAsStream(PdfName.FONTFILE2);
byte[] fontData = PdfReader.GetStreamBytes((PRStream)fontStream);
MemoryStream dataStream = new MemoryStream(fontData);
dataStream.Position = 0;
MemoryPackage memoryPackage = new MemoryPackage();
Uri uri = memoryPackage.CreatePart(dataStream);
GlyphTypeface glyphTypeface = new GlyphTypeface(uri);
memoryPackage.DeletePart(uri);
ICollection<string> names = glyphTypeface.FamilyNames.Values;
return names.Where(name => name.Contains("Arial")).Count() > 0;
}
else
{
// analogous code for other font subtypes
return false;
}
}
}
The MemoryPackage class is from this answer which was my first find searching for how to read information from a font in memory using .Net.
Applied to your PDF file like this:
using (PdfReader pdfReader = new PdfReader(SOURCE))
{
FontProgramRenderFilter fontFilter = new FontProgramRenderFilter();
ITextExtractionStrategy strategy = new FilteredTextRenderListener(
new LocationTextExtractionStrategy(), fontFilter);
Console.WriteLine(PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy));
}
the result is
This is Arial.
Beware: This is a mere proof of concept.
On one hand you will surely also need to implement the part commented as analogous code for other font subtypes above; and even the TYPE0 part is not ready for production use as it only considers FONTFILE2 and does not handle null values gracefully.
On the other hand you will want to cache names for fonts already inspected.

C# Web scraper copying text

I have a web scraper written in C# for extracting data. I want to copy text from the web browser control and paste it into a Word file programmatically. When I try to extract rich text box content using its ID and InnerText, the text contains encoded characters like %2c.
I need to get the text with all formatting but I can't find any way. I have tried Encoding, HTTPUtility.UrlDecode, SendKeys and elem.InvokeMember() without success.
How can I programmatically copy and paste text from web browser control preserving formatting?
Here is the sample data to extract:
Description
The Advance Concepts Engineering team designs and develops new vehicles which will meet future regulatory requirements and customer competitive requirements. A qualified candidate will be responsible for the total vehicle packaging. The candidate will identify and resolve adaptation and packaging issues as the vehicle moves toward production. They will lead cross functional team meetings working with Systems & Components, Advance Manufacturing, Service, etc. to ensure that the solutions are optimized for all stages of the vehicle's life.
HtmlElement elem = wb.Document.GetElementById("ctl00_contplhDynamic_txtDescrContentHiddenTextarea");
if (elem == null) return;
elem.InvokeMember("Click");
//elem.InvokeMember("Select All");
//elem.InvokeMember("Copy");
SendKeys.SendWait("^a");
SendKeys.SendWait("^c");
Clipboard.Clear();
elem.Focus();
elem.InvokeMember("Right Click");
elem.InvokeMember("Select All");
elem.InvokeMember("Copy");
Clipboard.SetText(elem.InnerText);
string clipbrdText = Clipboard.GetText();
string data = elem.InnerText;
richTextBox1.Text = data;
string temp = System.Web.HttpUtility.UrlDecode(data);
Encoding iso = Encoding.GetEncoding("windows-1252");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(data);
byte[] isoBytes = Encoding.Convert(utf8, iso, utfBytes);
string msg = iso.GetString(isoBytes);
The text with "%2c" etc has been encoded. If you are getting the content of a web page, you are decoding the HTML, not the URL. You can use HttpUtility.HtmlDecode, or if you are using .NET 4.0 or above you can also use WebUtility.HtmlDecode - this is available within the System.Net namespace.
You should note that Word does not use HTML for its formatting, so you won't be able to paste HTML tags and expect it to recognise them. i.e. <strong>Description</strong> will not result in bold text if you type that into Word.
EDIT:
It looks like you are mixing two different ways to copy the text in the code you pasted - both SendKeys.SendWait("^c"); and elem.InvokeMember("Copy");. I presume both of these methods work?
I think the problem you are having lies in the way you are getting the text. I see you're using Clipboard.GetText() to get the text. Try specifying that it is formatted text using Clipboard.GetText(TextDataFormat.Rtf) or Clipboard.GetText(TextDataFormat.Html). This should hopefully copy the string preserving the formatting.

How do I add JPEG comment (COM) to an image?

I'm trying to add a JPEG comment to an image file using WPF. Trying the following code throws me an ArgumentOutOfRangeException. Setting other properties works without problems.
using (Stream read = File.OpenRead(#"my.jpeg"))
{
JpegBitmapDecoder decoder = new JpegBitmapDecoder(read, BitmapCreateOptions.None, BitmapCacheOption.None);
var meta = decoder.Frames[0].Metadata.Clone() as BitmapMetadata;
meta.SetQuery("/app1/ifd/exif:{uint=40092}", "xxx"); // works
meta.SetQuery("/com/TextEntry", "xxx"); // does not work
}
To be clear: I have to set the /com/TextEntry field which is listed in MSDN http://msdn.microsoft.com/en-us/library/windows/desktop/ee719904%28v=vs.85%29.aspx#_jpeg_metadata
The data is read by another application which only supports this tag, so it is not an option to use other "comment" fields.
Any ideas?
The data type for /com/TextEntry is a bit tricky, it requires an LPSTR. Which is a raw 8-bit encoded string pointer. You can do this by passing a char[] for the argument. Fix:
meta.SetQuery("/com/TextEntry", "xxx".ToCharArray());
Do note that text encoding might be an issue if you use non-ASCII characters, you'll get text encoded in the machine's default code page (Encoding.Default).

ABCpdf 5 Problems with encoding (special characters)

I am using ABCpdf Version 5 in order to render some html-pages into PDFs.
I basically use HttpServerUtility.Execute() - Method in order to retrieve the html for the pdf:
System.IO.StringWriter writer = new System.IO.StringWriter();
server.Execute(requestUrl, writer);
string pageResult = writer.ToString();
WebSupergoo.ABCpdf5.Doc pdfDoc = new WebSupergoo.ABCpdf5.Doc();
pdfDoc.AddImageHtml(pageResult);
response.Buffer = false;
response.ContentType = "application/pdf";
response.AddHeader("Content-Disposition", "attachment;filename=MyPdf_" +
FormatDate(DateTime.Now, "yyyy-MM-dd") + ".pdf");
response.BinaryWrite(pdfDoc.GetData());
Now some special characters like Umlaute (äöü) are replaced with an empty space. Interestingly not all of them. What I did figure out:
Within the html-page I have.
`<meta http-equiv="content-type" content="text/xhtml; charset=utf-8" />`
If I parse this away, all special chars are rendered correctly. But this seems to me like an ugly hack.
In earlier days I did not use HttpServerUtility.Execute(), but I let ABCpdf call the URL itself: pdfDoc.AddImageUrl("someUrl");. There I had no such encoding-problems.
What could I try else?
Just came across this problem with ABCpdf 8.
In your code you retrieve HTML contents and pass the pageResult to AddImageHtml(). As the documentation states,
ABCpdf saves this HTML into a temporary file and renders the file
using a 'file://' protocol specifier.
What is not mentioned is that the temp file is UTF-8 encoded, but the encoding is not stated in the HTML file.
The <meta> tag actually sets the required encoding, and solved my problem.
One way to avoid the declaration of the encoding is to use the AddImageUrl() method that I expect to detect the HTML encoding from the HTTP/HTML response.
Encoding meta tag and AddImageURL method perhaps helps with simple document, but not in a chain situation, where encoding somehow gets lost despite encoding tag. I encountered this problem (exactly as described in original question - some foreign characters such as umlauts would disappear), and see no solution. I am considering getting rid of ABCPDF altogether and replace it with SSRS, which can render PDF formats.

Categories

Resources