Generate PDF from large html

Generate PDF from large html - c#

I am generating a PDF from an HTML string.
When this string is really long, I would like to create a new page, split the text (without breaking the html) and so on.
Here is my code :
// instantiate Pdf object
Aspose.Pdf.Generator.Pdf pdf = new Aspose.Pdf.Generator.Pdf();
// specify the Character encoding for for HTML file
pdf.HtmlInfo.CharSet = "UTF-8";
pdf.HtmlInfo.Margin.Left = 10;
pdf.HtmlInfo.Margin.Right = 10;
pdf.HtmlInfo.PageHeight = 1050;
pdf.HtmlInfo.PageWidth = 730;
pdf.HtmlInfo.ShowUnknownHtmlTagsAsText = true;
pdf.HtmlInfo.TryEnlargePredefinedTableColumnWidthsToAvoidWordBreaking = true;
pdf.HtmlInfo.CharsetApplyingLevelOfForce = Aspose.Pdf.Generator.HtmlInfo.CharsetApplyingForceLevel.UseWhenImpossibleDetectFromContent;
// bind the source HTML
pdf.BindHTML("MyVeryVeryLongHTML");
MemoryStream stream = new MemoryStream();
pdf.Save(stream);
byte[] pdfBytes = stream.ToArray();
This code works for the HTML, but the overflow is not handled. The text continue after the page. Is it possible to set a max "height" of the page to not cross, and if it does, it recreates a new page ?
Hope it makes sense !
Thanks a lot

You can set the Page height by selecting type of PDF page you require like A1, A2, etc . Afterwords , your problem of page height will automatically be taken care by the Aspose. For more refer the link..
Aspose PDF Page Height
Update
update pdf.HtmlInfo to pdf.PageSetup (or pdf.PageInfo) and add bottom margin also.

Related

Chinese Localization during PDF Generation on a Chinese server results in blocks being printed

We have an ASP.Net MVC application that we are localizing to multiple languages. All languages are rendered right on the UI, including Chinese. When generating a pdf, using ExpertPDF, the Chinese PDF is fine as long as the OS acting as the server is not in Chinese (We've tried English, French, Dutch, Polish, Spanish, German). However, upon changing the OS localization (winsystemlocale) to zh-CN, the PDF generation somehow breaks. BUT, only the PDF generation breaks, the UI is rendered just fine.
We don't do anything special to the actual content of the generated PDF when we process it with ExpertPDF, we literally just pass the html of the existing screen with some layout changes.
We've been scratching our heads with this one. Any help is appreciated. Thank you.
EDIT
Upon downloading both the broken pdf and a correct pdf, we found out that somehow the embedded fonts change depending on the winsystemlocale. It seems that the library can't find the SimSun font when the winsystemlocale is in zh-CN.
Further investigation also suggests that for some reason, .NET can no longer supply the correct fallback font to the PDF once the winsystemlocale is in zh-CN. We're aware that the display name for SimSun font changes but can still be initialize using the same text.
Here is the code we use to generate the PDF
docTitle = string.IsNullOrEmpty(docTitle) ? "DXT Document" : docTitle;
byte[] pdfBytes = null;
PdfConverter converter = CreateConverter();
converter.PdfDocumentInfo.Title = docTitle;
HtmlToPdfArea header = new HtmlToPdfArea(headerHtml, baseURL);
HtmlToPdfArea footer = new HtmlToPdfArea(footerHtml, baseURL);
header.AddPadding = true;
footer.AddPadding = true;
header.FitHeight = true;
header.FitWidth = true;
footer.FitHeight = true;
footer.FitWidth = true;
converter.PdfHeaderOptions.AddHtmlToPdfArea(header);
converter.PdfFooterOptions.AddHtmlToPdfArea(footer);
converter.PdfDocumentOptions.PdfPageOrientation = isLandscape ? PDFPageOrientation.Landscape : PDFPageOrientation.Portrait;
converter.PdfDocumentOptions.FitWidth = true;
converter.PdfDocumentOptions.EmbedFonts = true;
converter.PageWidth = virtualViewWidth;
converter.PdfHeaderOptions.HeaderHeight = headerHeight;
var translated = LocalizationService.GetTranslation<Resources.Apheresis.ProcedureRecords>("Of");
converter.PdfFooterOptions.TextArea = new TextArea(pageNumberXPos, pageNumberYPos, $"&p; {translated} &P;",
new System.Drawing.Font(new System.Drawing.FontFamily("Arial"), 8, System.Drawing.GraphicsUnit.Point));
Document doc = converter.GetPdfDocumentObjectFromHtmlString(contentHtml, baseURL);
doc.OpenAction.Action = new ExpertPdf.HtmlToPdf.PdfDocument.PdfActionJavaScript("print();");
pdfBytes = doc.Save();
Here is the screen
Here is the generated PDF

Getting Paper Size of a PDF

I am using PDFsharp to read / write PDF in our application. I am trying to get the paper size of a page to show in metadata window. I am using following code to do the same:
// read PDF document into memory
      var pdfDocument = PdfReader.Open(pdfPath);
      // if PDF document has more than one page, then try to get the page size from first page
      if (pdfDocument.Pages.Count > 0)
      {
        var pageSize = pdfDocument.Pages[0].Size.ToString();
      } // if
But every time it returns 'Undefined', I even tried to create A4 page document for MS Word. But still it returns 'Undefined'.
I also created a PDF from HTML but then also page size comes as undefined.
static void Main(string[] args)
{
// create PDF config to generate PDF
var config = new PdfGenerateConfig()
{
PageSize = PageSize.Letter,
PageOrientation = PageOrientation.Landscape
};
// load the HTML file
var html = System.IO.File.ReadAllText(#"C:\Users\mohit\Desktop\Temp\Ekta\Test.html");
// generate the PDF
var pdf = PdfGenerator.GeneratePdf(html,
PageSize.Letter);
// get the paper size of first page
var size = pdf.Pages[0].Size;
// save the pdf document
pdf.Save(#"C:\Users\mohit\Desktop\Temp\Ekta\A13.pdf");
Console.ReadKey();
}

The properties you have to use are Width and Height. Convert them to millimeter or inch if you do not want to show the size in points.
The property Size is a convenient way to set standard page sizes for new pages as you can specify A4 or Letter. It will not be set for imported PDF files.

Edit Total Number of Pages in Footer SelectPDF

I am converting html page to pdf using HtmlToPdf() of SelectPDF. Since html content is big, I am breaking it in half and creating 2 PDFs.
I am struggling to edit the total_pages in the footer to display actual total number of the pages, not only the current document; as well as page_number to display the actual page number in the context of both PDFs.
How can I assess {page_number} and {total_pages} to calculate proper values? All examples I found use PdfDocument(), not HtmlToPdf().
Dim converter As New HtmlToPdf()
Dim text As New PdfTextSection(0, 10, "Page: {page_number} of {total_pages} ")
text.HorizontalAlign = PdfTextHorizontalAlign.Center
converter.Footer.Add(text)
I am tagging both C# and VB since SelectPDF is for both languages, and relevant sample from either one will work for me. Thank you

Today I've stumbled upon the same issue and I have found a work-around for the problem. The converter was able to show page numbers for it's the generated document but can't be aware of multiple generated files (you can't access the page properties) so all my pages I concatenated were showing Page 1 of 1.
First I define one PdfDocument (see it as the main document) and I use HtmlToPdf to append html converted files to this main document.
// Create converter
converter = new HtmlToPdf();
PdfTextSection text = new PdfTextSection(0, 10, "Page: {page_number} of {total_pages} ", new Font("Arial", 8));
text.HorizontalAlign = PdfTextHorizontalAlign.Right;
converter.Footer.Add(text);
// Create main document
pdfDocument = new PdfDocument();
Then I add pages (from html) using this method
public void AddPage(string htmlPage)
{
PdfDocument doc = converter.ConvertHtmlString(htmlPage);
pdfDocument.Append(doc);
converter.Footer.TotalPagesOffset += doc.Pages.Count;
converter.Footer.FirstPageNumber += doc.Pages.Count;
}
This results in correct page numbers for the main document. The same trick could be used for splitting files and page numbers over multiple documents like you described.
EDIT: In case you don't see any page numbering using the HtmlToPdf converter, don't forget to set following property:
converter.Options.DisplayFooter = true;

There is an open source library called itextsharp that will help get total page count.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.xml;
namespace GetPages_PDF
{
class Program
{
static void Main(string[] args)
{
// Right side of equation is location of YOUR pdf file
string ppath = "C:\\aworking\\Hawkins.pdf";
PdfReader pdfReader = new PdfReader(ppath);
int numberOfPages = pdfReader.NumberOfPages;
Console.WriteLine(numberOfPages);
Console.ReadLine();
}
}
}
Then you can stamp text also on the page but you will need to add the location to where it needs to go.
link: http://crmhunt.com/how-to-modify-pdf-file-using-itextsharp/
hope this helps in some way.

You should use the following properties:
FirstPageNumber - Controls the page number for the first page being
rendered.
TotalPagesOffset - Controls the total number of pages
offset in the generated pdf document.
More details here:
http://selectpdf.com/html-to-pdf/docs/html/HtmlToPdfHeadersAndFooters.htm

The answers above did not work for me as I was trying to merge multiple PDFs with different orientations. bonnoj's answer did add page numbers but they were incorrect and I couldn't find a way to correct them. So I took a different approach - I created a PDF, then for each HTML page I added a pdfPage and then added a PdfHtmlElement to that page. Finally I loop over the pages and add a custom footer to each page. This may not be the most efficient way to do this but it's the only way that I could find that added the footer in the correct place when mixing portrait and landscape pages. Hopefully it will save somebody else spending hours playing with different properties.
var pdfDocument = new PdfDocument(PdfStandard.Full);
foreach (var (html, pdfPageOrientation) in pages)
{
var page = pdfDocument.AddPage(PdfCustomPageSize.A4, new PdfMargins(marginLeft, marginRight, marginTop, marginBottom));
page.Orientation = pdfPageOrientation;
var pdfHtmlElement = new PdfHtmlElement(html, "");
page.Add(pdfHtmlElement);
}
var pdfFont = pdfDocument.AddFont(PdfStandardFont.Helvetica);
pdfFont.Size = 12;
foreach (PdfPage page in pdfDocument.Pages)
{
var customFooter = pdfDocument.AddTemplate(page.PageSize.Width, 30);
var pdfFooterTextElement = new PdfTextElement(0, 15,
pageFooterText,
pdfFont)
{
HorizontalAlign = PdfTextHorizontalAlign.Right,
VerticalAlign = PdfTextVerticalAlign.Bottom,
};
customFooter.Add(pdfFooterTextElement);
page.CustomFooter = customFooter;
}
pdfDocument.Save(stream);

Create Multi-page Index File(TOC) for merged pdf using itext library in java

How can I write a multi-page ToC to the end of a PDF consisting of merged documents, using iTextSharp?
The answer to Create Index File(TOC) for merged pdf using itext library in java explains how to create a ToC page when merging PDFs (catalogued in the iTextSharp book http://developers.itextpdf.com/examples/merging-pdf-documents/merging-documents-and-create-table-contents#795-mergewithtoc.java). Code in this answer is based on those examples.
However it only works if the ToC is 1 page long. If the content becomes longer, then it repeats itself on the same page rather than spanning into the next page.
Trying to add the link directly to the text via:
ct.Add(new Chunk("link").SetLocalGoto("p1"))
causes an exception ("Cannot add Annotations, not enough pages in document").
Can anyone explain a method that will allow me to append multiple pages of content to a PDF when merging them (the more general the approach, the better). Is there a way to write into the document using Document.Add() instead of having to copy in template pages and write on the top of them?
(Note, code is in c#)

This answer is based on the example from the iTextSharp documentation, but converted to C#.
To make the added text span multiple pages, I found I could use ColumnText.HasMoreText(ct.Go()) to tell me if there was more text than could fit on the current page. You can then save the current page, re-create a new page template, and move the columntext to the new page. Below this is in a function called CheckForNewPage:
private bool CheckForNewPage(PdfCopy copy, ref PdfImportedPage page, ref PdfCopy.PageStamp stamp, ref PdfReader templateReader, ColumnText ct)
{
if (ColumnText.HasMoreText(ct.Go()))
{
//Write current page
stamp.AlterContents();
copy.AddPage(page);
//Start a new page
ct.SetSimpleColumn(36, 36, 559, 778);
templateReader = new PdfReader("template.pdf");
page = copy.GetImportedPage(templateReader, 1);
stamp = copy.CreatePageStamp(page);
ct.Canvas = stamp.GetOverContent();
ct.Go();
return true;
}
return false;
}
This should be called each time text is added to the ct variable.
If CheckForNewPage returns true you can then increment the page count, and reset the y variable to the top of the new page so that link annotation is in the correct place on the new page.
e.g.
var tocPageCount = 0;
var para = new iTextSharp.text.Paragraph(documentName);
ct.AddElement(para);
ct.Go();
if (CheckForNewPage(context, copy, ref page, ref stamp, ref tocReader, ct))
{
tocPageCount++;
y = 778;
}
//Add link annotation
action = PdfAction.GotoLocalPage(d.DocumentID.ToString(), false);
link = new PdfAnnotation(copy, TOC_Page.Left, ct.YLine, TOC_Page.Right, y, action);
stamp.AddAnnotation(link);
y = ct.YLine;
This creates the pages correctly. The below code adapts the end of ToC2 example for re-ordering the pages, in order to handle more than 1 page.
var rdr = new PdfReader(baos.toByteArray());
var totalPageCount = rdr.NumberOfPages;
rdr.SelectPages(String.Format("{0}-{1}, 1-{2}", totalPageCount - tocPageCount +1, totalPageCount, totalPageCount - tocPageCount));
PdfStamper stamper = new PdfStamper(rdr, new FileStream(outputFilePath, FileMode.Create));
stamper.Close();
By re-using the CheckForNewPage function, you should be able to add any content to new pages you create, and have it span multiple pages. If you don't need the annnotations you call CheckForNewPage in a loop at the end of adding all your content (just don't call ct.Go() beforehand).

iTextSharp can't read some PDF files

I have a problem to read and display content of some PDFs into RichTextBox.
I use the following code:
string fileName = #"C:\Users\PC\Desktop\SomePdf.pdf";
string str = string.Empty;
PdfReader reader = new PdfReader(fileName);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, i, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
str = str + s;
rtbVsebina.Text = str;
}
reader.Close();
Some PDFs can be read and displayed into RichTextBox and some they can not be. For those that can not be read I only get empty RichTextBox but with some added lines as I would press Key 'Enter' on the keyboard a couple of times.
Does anybody know what could be wrong?

You are confusing page content with page annotations.
Page content is part of the content stream of a page. It's referred to in the /Contents entry of the page dictionary and (optionally) in external objects (aka XObjects). With the code snippet you have copy/pasted in your question, you are extracting this content.
A rich text box is one of the many types of annotations. Annotations are not part of the content stream of a page. They are referred to from the /Annots entry of the page dictionary. If you want to get the contents of an annotation, you need to ask the page for its annotations instead of parsing the content of the page. See for instance Reading PDF Annotations with iText.
In answer to your question "What am I doing wrong": you were looking at the wrong place.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Generate PDF from large html - c#

Related

Chinese Localization during PDF Generation on a Chinese server results in blocks being printed

Getting Paper Size of a PDF

Edit Total Number of Pages in Footer SelectPDF

Create Multi-page Index File(TOC) for merged pdf using itext library in java

iTextSharp can't read some PDF files

Categories

Resources