NReco HTML-to-PDF Generator GeneratePdfFromFiles method throws exception

NReco HTML-to-PDF Generator GeneratePdfFromFiles method throws exception - c#

I have a fully working system for creating single page PDFs from HTML as below;
After initializing the converter
var nRecoHTMLToPDFConverter = new HtmlToPdfConverter();
nRecoHTMLToPDFConverter = PDFGenerator.PDFSettings(nRecoHTMLToPDFConverter);
string PDFContents;
PDFContents is an HTML string which is being populated.
The following command works perfectly and gives me the byte[] which I can return;
createDTO.PDFContent = nRecoHTMLToPDFConverter.GeneratePdf(PDFContents);
The problem arises when I want to test and develop the multi page functionality of the NReco library and change an arbitrary number of HTML pages to PDF pages.
var stringArray = new string[]
{
PDFContents, PDFContents,
};
var stream = new MemoryStream();
nRecoHTMLToPDFConverter.GeneratePdfFromFiles(stringArray, null, stream);
var mybyteArray = stream.ToArray();
the PDFContents are exactly the same as above. On paper, this should give me the byte array for 2 identical PDF pages however on call to GeneratePdfFromFiles method, I get the following exception;
WkHtmlToPdfException: Exit with code 1 due to network error: HostNotFoundError (exit code: 1)
Please help me resolve this if you have experience with this library and its complexities. I have a feeling that I'm not familiar with the proper use of a Stream object in this scenario. I've tested the working single page line and the malfunctioning multi page lines on the same method call so their context would be identical.
Many thanks

GeneratePdfFromFiles method you used expects array of file names (or URLs): https://www.nrecosite.com/doc/NReco.PdfGenerator/?topic=html/M_NReco_PdfGenerator_HtmlToPdfConverter_GeneratePdfFromFiles_1.htm
If you operate with HTML content as .NET strings you may simply save it to temp files, generate PDF and remove after that.

Related

iTextSharp exception: Rebuild failed: Dictionary key R is not a name. at file pointer 9195

I have a Windows Service that uses iTextSharp version 5.5.11 to read PDFs out of the DB, pull their text, and upload that text back to the DB for easy searching.
It has worked quite well. But now, it seems every PDF that it opens throws the same exception:
iTextSharp.text.exceptions.InvalidPdfException: Rebuild failed: Dictionary key R is not a name. at file pointer 9195; Original message: Dictionary key R is not a name. at file pointer 9195
at iTextSharp.text.pdf.PdfReader..ctor(IRandomAccessSource byteSource, Boolean partialRead, Byte[] ownerPassword, X509Certificate certificate, ICipherParameters certificateKey, Boolean closeSourceOnConstructorError)
at iTextSharp.text.pdf.PdfReader..ctor(Byte[] pdfIn)
These PDFs are all from different sources, uploaded in different manners at different times - but cannot/will not open via iTextSharp.
This same process has worked for many of other PDFs. And while it appears to be happening to most PDFs now, it's not 100% of them because some still do slip past this piece of code.
Is there something in my code that is causing this? Or do all of these seemingly random PDFs actually have the exact same issue? Or is there something else causing this? Any help is appreciated!
EDIT 1:
Here is the code with the exception; the last line is where the exception occurs:
public string ParseDocumentsFileContents(byte[] PdfFileData, Guid fileId, string fileName)
{
if (PdfFileData == null || PdfFileData.Count() <= 0)
{
return null;
}
iTextSharp.text.pdf.PdfReader.unethicalreading = true;
//read the text from each page, and put it all together
var PageContents = new System.Text.StringBuilder(PdfFileData.Count());
using (var engine = new Tesseract.TesseractEngine(tessdata_datapath, "eng", Tesseract.EngineMode.Default))
using (var reader = new iTextSharp.text.pdf.PdfReader(PdfFileData))
And here's an example of a PDF that resulted in this exception. (I realize there is no text in the PDF, but I had to find one that I could share.)
Here are some other examples of PDFs that all resulted in this exception:
Example #2, Example #3, Example #4
EDIT 2:
I upgraded to iTextSharp version 5.5.12 and have the same exceptions. But in both versions, the following PDFs (and others) do NOT result in this exception: Does Not Exception Here #1 and Does Not Exception Here #2

PdfTextExtractor.GetTextFromPage suddenly giving empty string

We've been using the iTextSharp libraries for a couple of years now within an SSIS process to read some values out of a set of PDF exam documents. Everything has been running nicely until this week when suddenly we are getting the return of an empty string when calling the PdfTextExtractor.GetTextFromPage method. I'll include the code here:
// Read the data from the blob column where the PDF exists
byte[] byteBuffer = Row.FileData.GetBlobData(0, (int)Row.FileData.Length);
using (var pdfReader = new PdfReader(byteBuffer))
{
// Here is the important stuff
var extractStrategy = new LocationTextExtractionStrategy();
// This call will extract the page with the proper data on it depending on the exam type
// 1-page exams = NBOME - need to read first page for exam result data
// 2-page exams = NBME - need to read second page for exam result data
// The next two statements utilize this construct.
var vendor = pdfReader.NumberOfPages == 1 ? "NBOME" : "NBME";
*** THIS NEXT LINE GIVES THE EMPTY STRING
var newText = PdfTextExtractor.GetTextFromPage(pdfReader, pdfReader.NumberOfPages == 1 ? 1 : 2, extractStrategy);
var stringList = newText.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);
var fileParser = FileParseFactory.GetFileParse(stringList, vendor);
// Populate our output variables
Row.ParsedExamName = fileParser.GetExamName(stringList);
Row.DateParsed = DateTime.Now;
Row.ParsedId = fileParser.GetStudentId(stringList);
Row.ParsedTestDate = fileParser.GetTestDate(stringList);
Row.ParsedTestDateString = fileParser.GetTestDateAsString(stringList);
Row.ParsedName = fileParser.GetStudentName(stringList);
Row.ParsedTotalScore = fileParser.GetTestScore(stringList);
Row.ParsedVendor = vendor;
}
This is not for all PDFs, by the way. To explain more, we are reading in exam files. One of the exam types (NBME) seems to be reading just fine. However, the other type (NBOME) is not. However, prior to this week, the NBOME ones were being read fine.
This leads me to think it is an internal format change of the PDF file itself.
Also, another bit of information is that the actual pdfReader has data - I can get a byte[] array of the data - but the call to get any text is simply giving me empty.
I'm sorry I'm not able to show any exam data or files - that information is sensitive.
Has anybody seen something like this? If so, any possible solutions?

Well - we have found our answer. The user was originally going to the NBOME web site and downloading the PDF exam result files to import into my parsing system. Like I said, this worked for quite some time. Recently (this week), however, the user started not downloading the files, but using a PDF printing feature and printed the PDF files as PDF. When she did that, the problem occurred.
Bottom line, it looks like the printing the PDF as PDF may have been injecting some characters or something under the covers that was causing the reading of the PDF via iTextSharp to not fail, but to give an empty string. She should have just continued downloading them directly.
Thanks to those who offered some comments!

some special character is replace by '?' while generating pdf from html

I am trying to generate pdf from html file using itextsharp library, but I have one issue in that when I convert html into pdf, some special character of html file is replace by '?' sign. (ex €)
here is my code :
var elements = XMLWorkerHelper.ParseToElementList(html, null);
foreach (var element in elements)
{
document.Add(element);
}
XMLWorkerHelper is a class of itextsharp library.
I just want that my pdf is generate same as my html file.

If you use XMLWorkerHelper.ParseToElementList(String, String) (which you are) then iTextSharp is going to ask the .Net runtime to figure out the contents of the file by calling System.Text.Encoding.Default.GetBytes().
Per the docs, System.Text.Encoding.Default
Gets an encoding for the operating system's current ANSI code page
And further (emphasis mine):
Different computers can use different encodings as the default, and the default encoding can even change on a single computer. Therefore, data streamed from one computer to another or even retrieved at different times on the same computer might be translated incorrectly. In addition, the encoding returned by the Default property uses best-fit fallback to map unsupported characters to characters supported by the code page. For these two reasons, using the default encoding is generally not recommended. To ensure that encoded bytes are decoded properly, you should use a Unicode encoding, such as UTF8Encoding or UnicodeEncoding, with a preamble. Another option is to use a higher-level protocol to ensure that the same format is used for encoding and decoding.
So from the above you'll see that in the absence of any information in the file about how the raw bytes are intended to be interpreted, .Net will just use the local code page to interpret them. What's really fun is if you move your code 100% exactly as-is to another machine you might get different results because that machine might have a different code page set.
The best solution is to avoid code pages completely. To do this, just save the file as Unicode compatible format such as UTF8 and include a BOM to explicitly declare your intentions. The BOM is optional (and frowned upon by some people) but it is also the most explicit way in the absence of other information (such as HTTP headers or post-it notes).
The second option is to just re-implement XMLWorkerHelper.ParseToElementList() with your appropriate encoding. SourceForge is apparently down right now so here's the body of that method:
/**
* Parses an HTML string and a string containing CSS into a list of Element objects.
* The FontProvider will be obtained from iText's FontFactory object.
*
* #param html a String containing an XHTML snippet
* #param css a String containing CSS
* #return an ElementList instance
*/
public static ElementList ParseToElementList(String html, String css) {
// CSS
ICSSResolver cssResolver = new StyleAttrCSSResolver();
if (css != null) {
ICssFile cssFile = XMLWorkerHelper.GetCSS(new MemoryStream(Encoding.Default.GetBytes(css)));
cssResolver.AddCss(cssFile);
}
// HTML
CssAppliers cssAppliers = new CssAppliersImpl(FontFactory.FontImp);
HtmlPipelineContext htmlContext = new HtmlPipelineContext(cssAppliers);
htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
htmlContext.AutoBookmark(false);
// Pipelines
ElementList elements = new ElementList();
ElementHandlerPipeline end = new ElementHandlerPipeline(elements, null);
HtmlPipeline htmlPipeline = new HtmlPipeline(htmlContext, end);
CssResolverPipeline cssPipeline = new CssResolverPipeline(cssResolver, htmlPipeline);
// XML Worker
XMLWorker worker = new XMLWorker(cssPipeline, true);
XMLParser p = new XMLParser(worker);
p.Parse(new MemoryStream(Encoding.Default.GetBytes(html)));
return elements;
}
The second to last line of code that starts p.Parse is what you'd want to change. Since we don't know what the bytes of your file are (and neither does your computer, apparently) we can't tell you what to switch the encoder over to.
Just to wrap up, this actually isn't an iTextSharp problem at all, this is actually the default behavior of the .Net runtime. iTextSharp is just using system default in the absence of information.

p.parse(new StringReader(html));
this worked for me

Reading Byte[] into Database from API

I have been reading this as I would like to do this via LINQ. However, I have been unable to figure out how to read the data from the API.
When I output resource.Data.Body it says, Byte[].
When I output resource.Data.Size it says, 834234822. (or something like that)
I am trying to save the contents into my database like this:
newContent.ATTACHMENT = resource.Data.Body;
However, no data is ever loaded. I assume I have to loop through Body and store the contents in a variable, but I am not sure how.
Can someone help me connect the dots here?
Edit:
This is the source of the binary data I am trying to read http://dev.evernote.com/start/core/resources.php
Edit 2:
I am using the following code which gives me binary data and saves to database, but it must be corrupt, or something because when I go to open the file Windows photo viewer says it's corrupt or too large...
Resource resource = noteStore.getResource(authToken, attachment.Guid, true, false, true, true);
StringBuilder data = new StringBuilder();
foreach(byte b in resource.Data.Body)
{
data.Append(Convert.ToString(b, 2).PadLeft(8, '0'));
}
...
newContent.ATTACHMENT = System.Text.Encoding.ASCII.GetBytes(data.ToString());

Given that resource.Data.Body is byte[], and newContent.ATTACHMENT is System.Data.Linq.Binary, you should use the constructor on System.Data.Linq.Binary which takes an input parameter of type byte[]. http://msdn.microsoft.com/en-us/library/bb351422.aspx
newContent.ATTACHMENT = new System.Data.Linq.Binary(resource.Data.Body);

Need Alternative to EO.Pdf for Converting HTML to PDF in C#, wkhtmltopdf?

I am creating a HTML catalog of movies, and then converting it to PDF. I was using EO.Pdf, and it worked for my small test sample. However, when I run it against the entire list of movies, the resulting HTML file is nearly 8000 lines, and 7MB. EO.Pdf times out when attempting to convert it. I believe it is a limitation of the free version, as I can copy the entire HTML and paste it into their online demo and it works.
I am looking for an alternative to use. I am not good with command line, or running external programs, so I would prefer something I can add to the .NET library and use easily. I will admit that the use of EO.Pdf was easy, once I added the dll to the libarary and added the namespace, it took one line of code to convert either the HTML Code, or the HTML file into a PDF. The downsides I ran into were that they had a stamp on every page (in 16pt font) with their website on it. It also wouldn't pick up half of my images, not sure why. I used a relative URL in the HTML file to the images, and I created the PDF in the same dir as the HTML file.
I do not want to re-create the layout in a PDF, so I think something like iTextSharp is out. I've read a bit about something called like wkhtmltopdf or something strange like that. It sounded good, but needed a wrapper, and I have no clue how to accomplish that, or use it.
I would appreciate suggestions with basic instructions how to use them. Either a library and a couple lines on how to use it. Or if you can tell me how to setup/use the wkhtmltopdf I would be extremely greatful!
Thanks in advance!

I'm using wkhtmltopdf and I'm very happy with it. One way of using it:
Process p = new Process();
p.StartInfo.CreateNoWindow = true;
p.StartInfo.UseShellExecute = false;
p.StartInfo.RedirectStandardOutput = true;
p.StartInfo.FileName = "wkhtmltopdf.exe";
p.StartInfo.Arguments = "-O landscape <<URL>> -";
p.Start();
and then you can get a stream:
p.StandardOutput.BaseStream
I used this because I needed a stream, of course you can invoke it differently.
Here is also a discussion about invoking wkhtmltopdf
I just saw, that someone is implementing a c# wrapper for wkhtmltopdf. I haven't tested it, but may be worth a look.

After much searching I decided to use HiQPdf. It was simple to use, fast enough for my needs and the price point was acceptable to me.
var converter = new HiQPdf.HtmlToPdf();
converter.Document.PageSize = PdfPageSize.Letter;
converter.Document.PageOrientation = PdfPageOrientation.Portrait;
converter.Document.Margins = new PdfMargins(15); // Unit = Points
converter.ConvertHtmlToFile(htmlText, null, fileName);
It even includes a free version if you can keep it to 3 pages.
http://www.hiqpdf.com/free-html-to-pdf-converter.aspx
And no, I am in no way affiliated with them.

I recommend ExpertPdf.
ExpertPdf Html To Pdf Converter is very easy to use and it supports the latest html5/css3. You can either convert an entire url to pdf:
using ExpertPdf.HtmlToPdf;
byte[] pdfBytes = new PdfConverter().GetPdfBytesFromUrl(url);
or a html string:
using ExpertPdf.HtmlToPdf;
byte[] pdfBytes = new PdfConverter().GetPdfBytesFromHtmlString(html, baseUrl);
You also have the alternative to directly save the generated pdf document to a Stream of file on the disk.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.