Related
I am trying to Generate a Single PDF File From Multiple Memory Streams, I am having a lot of trouble determining the proper way to merge 2 PDF MemoryStreams into one PDF MemoryStream that contains all the pages from both source PDF MemoryStreams. It seems simple and I think the code below is set up properly but the resulting PDF memory stream does not contain both the Files Combined.
I am having a lot of trouble determining the proper way to merge 2 PDF MemoryStreams into one PDF MemoryStream that contains all the pages from both source PDF MemoryStreams. It seems simple and I think the code below is set up properly but the resulting PDF memory stream does not Contain Merged Documents.
I have found multiple ways documented on the Internet as the "proper" way to do the merge. The actual sample code with iText 7 seems to be unusually complex (in that is mixes multiple concepts into one sample repeatedly - as in doesn't reduce the concept to the simplest possible code), and seems to fail to demonstrate simple concepts. For instance, their PDFMerge documentation has no sample code at all in the documentation (nor does anything else I looked at in the class documentation). The examples they have online actually always mix merging from files (not MemoryStreams or byte[]) with other concepts like adding page numbers or adding Table of Contents. So they never just show one concept and they never start with anything other than files. My PDFs are coming out of a database and we just need to merge them into one PDF memory stream and save it back out. My concern is that maybe I am not creating the MemoryStream properly when I initialize the PDFWriter. As none of their samples ever do anything but initial with an actual file, I was unable to confirm this was done properly. I also fully qualified all objects in the code because I want to leave the old iTextSharp code in place while I am upgrading to the new iText 7. This was done to make sure an iTextSharp object of the same name wasn't inadvertently being unknowingly used.
Also, in the interest of making the source as easy as possible to read I removed some of the declarations and initialization of objects being used. Everything was traced through and all values are fully loaded with proper values as you trace through the code. I am assuming the problem is that I didn't prepare the PDF objects properly or that I have to do something special with the PDFWriter on the Destination PDF Document (ms) before the the PDFMerge object.
List<byte[]> streams = new List<byte[]>();
somelist.ForEach(item=>
{
using (var workStream = new MemoryStream())
using (var pdfWriter = new PdfWriter(workStream))
{
pdfWriter.SetCloseStream(false);
HtmlConverter.ConvertToPdf(strContent, pdfWriter);
streams.Add(workStream.ToArray());
pdfWriter.Close();
}
}
MemoryStream ms = new MemoryStream();
PdfWriter writer = new PdfWriter(ms);
PdfDocument document = new PdfDocument(writer);
PdfMerger merger = new PdfMerger(document);
streams.ForEach(stream =>
{
Stream msDoc = new MemoryStream(stream);
PdfDocument doc = new PdfDocument(new PdfReader(msDoc));
merger.Merge(doc, 1, doc.GetNumberOfPages());
doc.Close();
});
ByteContent = ms.ToArray();
document.Close();
Merging is a really straightforward process:
var SourceDocument1 = new PdfDocument(new PdfReader(SRC));
var SourceDocument2 = new PdfDocument(new PdfReader(SRC1));
byte[] result;
using (var memoryStream = new MemoryStream())
{
var pdfWriter = new PdfWriter(memoryStream);
var pdfDocument = new PdfDocument(pdfWriter);
PdfMerger merge = new PdfMerger(pdfDocument);
merge.Merge(SourceDocument1, 1, SourceDocument1.GetNumberOfPages())
.Merge(SourceDocument2, 1, SourceDocument2.GetNumberOfPages());
merge.Close();
result = memoryStream.ToArray();
}
File.WriteAllBytes(#"C:\temp\file.pdf", result);
this will merge SRC with SRC1.
There are a lot of examples on Github, such as this one (there is also a whole folder with merge examples).
I am writing the destination document in the end, just to make sure it's being created correctly, but you can do whatever you want to with the MemoryStream, of course.
I am posting this question because many developers ask more or less the same question in different forms. I will answer this question myself (I am the Founder/CTO of iText Group), so that it can be a "Wiki-answer." If the Stack Overflow "documentation" feature still existed, this would have been a good candidate for a documentation topic.
The source file:
I am trying to convert the following HTML file to PDF:
<html>
<head>
<title>Colossal (movie)</title>
<style>
.poster { width: 120px;float: right; }
.director { font-style: italic; }
.description { font-family: serif; }
.imdb { font-size: 0.8em; }
a { color: red; }
</style>
</head>
<body>
<img src="img/colossal.jpg" class="poster" />
<h1>Colossal (2016)</h1>
<div class="director">Directed by Nacho Vigalondo</div>
<div class="description">Gloria is an out-of-work party girl
forced to leave her life in New York City, and move back home.
When reports surface that a giant creature is destroying Seoul,
she gradually comes to the realization that she is somehow connected
to this phenomenon.
</div>
<div class="imdb">Read more about this movie on
IMDB
</div>
</body>
</html>
In a browser, this HTML looks like this:
The problems I encountered:
HTMLWorker doesn't take CSS into account at all
When I used HTMLWorker, I need to create an ImageProvider to avoid an error that informs me that the image can't be found. I also need to create a StyleSheet instance to change some of the styles:
public static class MyImageFactory implements ImageProvider {
public Image getImage(String src, Map<String, String> h,
ChainedProperties cprops, DocListener doc) {
try {
return Image.getInstance(
String.format("resources/html/img/%s",
src.substring(src.lastIndexOf("/") + 1)));
} catch (DocumentException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}
public static void main(String[] args) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream("results/htmlworker.pdf"));
document.open();
StyleSheet styles = new StyleSheet();
styles.loadStyle("imdb", "size", "-3");
HTMLWorker htmlWorker = new HTMLWorker(document, null, styles);
HashMap<String,Object> providers = new HashMap<String, Object>();
providers.put(HTMLWorker.IMG_PROVIDER, new MyImageFactory());
htmlWorker.setProviders(providers);
htmlWorker.parse(new FileReader("resources/html/sample.html"));
document.close();
}
The result looks like this:
For some reason, HTMLWorker also shows the content of the <title> tag. I don't know how to avoid this. The CSS in the header isn't parsed at all, I have to define all the styles in my code, using the StyleSheet object.
When I look at my code, I see that plenty of objects and methods I'm using are deprecated:
So I decided to upgrade to using XML Worker.
Images aren't found when using XML Worker
I tried the following code:
public static final String DEST = "results/xmlworker1.pdf";
public static final String HTML = "resources/html/sample.html";
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
document.open();
XMLWorkerHelper.getInstance().parseXHtml(writer, document,
new FileInputStream(HTML));
document.close();
}
This resulted in the following PDF:
Instead of Times-Roman, the default font Helvetica is used; this is typical for iText (I should have defined a font explicitly in my HTML). Otherwise, the CSS seems to be respected, but the image is missing, and I didn't get an error message.
With HTMLWorker, an exception was thrown, and I was able to fix the problem by introducing an ImageProvider. Let's see if this works for XML Worker.
Not all CSS styles are supported in XML Worker
I adapted my code like this:
public static final String DEST = "results/xmlworker2.pdf";
public static final String HTML = "resources/html/sample.html";
public static final String IMG_PATH = "resources/html/";
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
document.open();
CSSResolver cssResolver =
XMLWorkerHelper.getInstance().getDefaultCssResolver(true);
HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
htmlContext.setImageProvider(new AbstractImageProvider() {
public String getImageRootPath() {
return IMG_PATH;
}
});
PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);
XMLWorker worker = new XMLWorker(css, true);
XMLParser p = new XMLParser(worker);
p.parse(new FileInputStream(HTML));
document.close();
}
My code is much longer, but now the image is rendered:
The image is larger than when I rendered it using HTMLWorker which tells me that the CSS attribute width for the poster class is taken into account, but the float attribute is ignored. How do I fix this?
The remaining question:
So the question boils down to this: I have a specific HTML file that I try to convert to PDF. I have gone through a lot of work, fixing one problem after the other, but there is one specific problem that I can't solve: how do I make iText respect CSS that defines the position of an element, such as float: right?
Additional question:
When my HTML contains form elements (such as <input>), those form elements are ignored.
Why your code doesn't work
As explained in the introduction of the HTML to PDF tutorial, HTMLWorker has been deprecated many years ago. It wasn't intended to convert complete HTML pages. It doesn't know that an HTML page has a <head> and a <body> section; it just parses all the content. It was meant to parse small HTML snippets, and you could define styles using the StyleSheet class; real CSS wasn't supported.
Then came XML Worker. XML Worker was meant as a generic framework to parse XML. As a proof of concept, we decided to write some XHTML to PDF functionality, but we didn't support all of the HTML tags. For instance: forms weren't supported at all, and it was very hard to support CSS that is used to position content. Forms in HTML are very different from forms in PDF. There was also a mismatch between the iText architecture and the architecture of HTML + CSS. Gradually, we extended XML Worker, mostly based on requests from customers, but XML Worker became a monster with many tentacles.
Eventually, we decided to rewrite iText from scratch, with the requirements for HTML + CSS conversion in mind. This resulted in iText 7. On top of iText 7, we created several add-ons, the most important one in this context being pdfHTML.
How to solve the problem
Using the latest version of iText (iText 7.1.0 + pdfHTML 2.0.0) the code to convert the HTML from the question to PDF is reduced to this snippet:
public static final String SRC = "src/main/resources/html/sample.html";
public static final String DEST = "target/results/sample.pdf";
public void createPdf(String src, String dest) throws IOException {
HtmlConverter.convertToPdf(new File(src), new File(dest));
}
The result looks like this:
As you can see, this is pretty much the result you'd expect. Since iText 7.1.0 / pdfHTML 2.0.0, the default font is Times-Roman. The CSS is being respected: the image is now floating on the right.
Some additional thoughts.
Developers often feel opposed to upgrade to a newer iText version when I give the advice to upgrade to iText 7 / pdfHTML 2. Allow me to answer to the top 3 of arguments I hear:
I need to use the free iText, and iText 7 isn't free / the pdfHTML add-on is closed source.
iText 7 is released using the AGPL, just like iText 5 and XML Worker. The AGPL allows free use in the sense of free of charge in the context of open source projects. If you are distributing a closed source / proprietary product (e.g. you use iText in a SaaS context), you can't use iText for free; in that case, you have to purchase a commercial license. This was already true for iText 5; this is still true for iText 7. As for versions prior to iText 5: you shouldn't use these at all. Regarding pdfHTML: the first versions were indeed only available as closed source software. We have had heavy discussion within iText Group: on the one hand, there were the people who wanted to avoid the massive abuse by companies who don't listen to their developers when those developers tell the powers that be that open source isn't the same as free. Developers were telling us that their boss forced them to do the wrong thing, and that they couldn't convince their boss to purchase a commercial license. On the other hand, there were the people who argued that we shouldn't punish developers for the wrong behavior of their bosses. Eventually, the people in favor of open sourcing pdfHTML, that is: the developers at iText, won the argument. Please prove that they weren't wrong, and use iText correctly: respect the AGPL if you're using iText for free; make sure that your boss purchases a commercial license if you're using iText in a closed source context.
I need to maintain a legacy system, and I have to use an old iText version.
Seriously? Maintenance also involves applying upgrades and migrating to new versions of the software you're using. As you can see, the code needed when using iText 7 and pdfHTML is very simple, and less error-prone than the code needed before. A migration project shouldn't take too long.
I've only just started and I didn't know about iText 7; I only found out after I finished my project.
That's why I'm posting this question and answer. Think of yourself as an eXtreme Programmer. Throw away all of your code, and start anew. You'll notice that it's not as much work as you imagined, and you'll sleep better knowing that you've made your project future-proof because iText 5 is being phased out. We still offer support to paying customers, but eventually, we'll stop supporting iText 5 altogether.
Use iText 7 and this code:
public void generatePDF(String htmlFile) {
try {
//HTML String
String htmlString = htmlFile;
//Setting destination
FileOutputStream fileOutputStream = new FileOutputStream(new File(dirPath + "/USER-16-PF-Report.pdf"));
PdfWriter pdfWriter = new PdfWriter(fileOutputStream);
ConverterProperties converterProperties = new ConverterProperties();
PdfDocument pdfDocument = new PdfDocument(pdfWriter);
//For setting the PAGE SIZE
pdfDocument.setDefaultPageSize(new PageSize(PageSize.A3));
Document document = HtmlConverter.convertToDocument(htmlFile, pdfDocument, converterProperties);
document.close();
}
catch (Exception e) {
e.printStackTrace();
}
}
Convert a static HTML page take also any CSS Style:
HtmlConverter.convertToPdf(new File("./pdf-input.html"),new File("demo-html.pdf"));
For spring Boot user: Convert a dynamic HTML page using SpringBoot and Thymeleaf:
#RequestMapping(path = "/pdf")
public ResponseEntity<?> getPDF(HttpServletRequest request, HttpServletResponse response) throws IOException {
/* Do Business Logic*/
Order order = OrderHelper.getOrder();
/* Create HTML using Thymeleaf template Engine */
WebContext context = new WebContext(request, response, servletContext);
context.setVariable("orderEntry", order);
String orderHtml = templateEngine.process("order", context);
/* Setup Source and target I/O streams */
ByteArrayOutputStream target = new ByteArrayOutputStream();
ConverterProperties converterProperties = new ConverterProperties();
converterProperties.setBaseUri("http://localhost:8080");
/* Call convert method */
HtmlConverter.convertToPdf(orderHtml, target, converterProperties);
/* extract output as bytes */
byte[] bytes = target.toByteArray();
/* Send the response as downloadable PDF */
return ResponseEntity.ok()
.header(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=order.pdf")
.contentType(MediaType.APPLICATION_PDF)
.body(bytes);
}
I am posting this question because many developers ask more or less the same question in different forms. I will answer this question myself (I am the Founder/CTO of iText Group), so that it can be a "Wiki-answer." If the Stack Overflow "documentation" feature still existed, this would have been a good candidate for a documentation topic.
The source file:
I am trying to convert the following HTML file to PDF:
<html>
<head>
<title>Colossal (movie)</title>
<style>
.poster { width: 120px;float: right; }
.director { font-style: italic; }
.description { font-family: serif; }
.imdb { font-size: 0.8em; }
a { color: red; }
</style>
</head>
<body>
<img src="img/colossal.jpg" class="poster" />
<h1>Colossal (2016)</h1>
<div class="director">Directed by Nacho Vigalondo</div>
<div class="description">Gloria is an out-of-work party girl
forced to leave her life in New York City, and move back home.
When reports surface that a giant creature is destroying Seoul,
she gradually comes to the realization that she is somehow connected
to this phenomenon.
</div>
<div class="imdb">Read more about this movie on
IMDB
</div>
</body>
</html>
In a browser, this HTML looks like this:
The problems I encountered:
HTMLWorker doesn't take CSS into account at all
When I used HTMLWorker, I need to create an ImageProvider to avoid an error that informs me that the image can't be found. I also need to create a StyleSheet instance to change some of the styles:
public static class MyImageFactory implements ImageProvider {
public Image getImage(String src, Map<String, String> h,
ChainedProperties cprops, DocListener doc) {
try {
return Image.getInstance(
String.format("resources/html/img/%s",
src.substring(src.lastIndexOf("/") + 1)));
} catch (DocumentException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}
public static void main(String[] args) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream("results/htmlworker.pdf"));
document.open();
StyleSheet styles = new StyleSheet();
styles.loadStyle("imdb", "size", "-3");
HTMLWorker htmlWorker = new HTMLWorker(document, null, styles);
HashMap<String,Object> providers = new HashMap<String, Object>();
providers.put(HTMLWorker.IMG_PROVIDER, new MyImageFactory());
htmlWorker.setProviders(providers);
htmlWorker.parse(new FileReader("resources/html/sample.html"));
document.close();
}
The result looks like this:
For some reason, HTMLWorker also shows the content of the <title> tag. I don't know how to avoid this. The CSS in the header isn't parsed at all, I have to define all the styles in my code, using the StyleSheet object.
When I look at my code, I see that plenty of objects and methods I'm using are deprecated:
So I decided to upgrade to using XML Worker.
Images aren't found when using XML Worker
I tried the following code:
public static final String DEST = "results/xmlworker1.pdf";
public static final String HTML = "resources/html/sample.html";
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
document.open();
XMLWorkerHelper.getInstance().parseXHtml(writer, document,
new FileInputStream(HTML));
document.close();
}
This resulted in the following PDF:
Instead of Times-Roman, the default font Helvetica is used; this is typical for iText (I should have defined a font explicitly in my HTML). Otherwise, the CSS seems to be respected, but the image is missing, and I didn't get an error message.
With HTMLWorker, an exception was thrown, and I was able to fix the problem by introducing an ImageProvider. Let's see if this works for XML Worker.
Not all CSS styles are supported in XML Worker
I adapted my code like this:
public static final String DEST = "results/xmlworker2.pdf";
public static final String HTML = "resources/html/sample.html";
public static final String IMG_PATH = "resources/html/";
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
document.open();
CSSResolver cssResolver =
XMLWorkerHelper.getInstance().getDefaultCssResolver(true);
HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
htmlContext.setImageProvider(new AbstractImageProvider() {
public String getImageRootPath() {
return IMG_PATH;
}
});
PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);
XMLWorker worker = new XMLWorker(css, true);
XMLParser p = new XMLParser(worker);
p.parse(new FileInputStream(HTML));
document.close();
}
My code is much longer, but now the image is rendered:
The image is larger than when I rendered it using HTMLWorker which tells me that the CSS attribute width for the poster class is taken into account, but the float attribute is ignored. How do I fix this?
The remaining question:
So the question boils down to this: I have a specific HTML file that I try to convert to PDF. I have gone through a lot of work, fixing one problem after the other, but there is one specific problem that I can't solve: how do I make iText respect CSS that defines the position of an element, such as float: right?
Additional question:
When my HTML contains form elements (such as <input>), those form elements are ignored.
Why your code doesn't work
As explained in the introduction of the HTML to PDF tutorial, HTMLWorker has been deprecated many years ago. It wasn't intended to convert complete HTML pages. It doesn't know that an HTML page has a <head> and a <body> section; it just parses all the content. It was meant to parse small HTML snippets, and you could define styles using the StyleSheet class; real CSS wasn't supported.
Then came XML Worker. XML Worker was meant as a generic framework to parse XML. As a proof of concept, we decided to write some XHTML to PDF functionality, but we didn't support all of the HTML tags. For instance: forms weren't supported at all, and it was very hard to support CSS that is used to position content. Forms in HTML are very different from forms in PDF. There was also a mismatch between the iText architecture and the architecture of HTML + CSS. Gradually, we extended XML Worker, mostly based on requests from customers, but XML Worker became a monster with many tentacles.
Eventually, we decided to rewrite iText from scratch, with the requirements for HTML + CSS conversion in mind. This resulted in iText 7. On top of iText 7, we created several add-ons, the most important one in this context being pdfHTML.
How to solve the problem
Using the latest version of iText (iText 7.1.0 + pdfHTML 2.0.0) the code to convert the HTML from the question to PDF is reduced to this snippet:
public static final String SRC = "src/main/resources/html/sample.html";
public static final String DEST = "target/results/sample.pdf";
public void createPdf(String src, String dest) throws IOException {
HtmlConverter.convertToPdf(new File(src), new File(dest));
}
The result looks like this:
As you can see, this is pretty much the result you'd expect. Since iText 7.1.0 / pdfHTML 2.0.0, the default font is Times-Roman. The CSS is being respected: the image is now floating on the right.
Some additional thoughts.
Developers often feel opposed to upgrade to a newer iText version when I give the advice to upgrade to iText 7 / pdfHTML 2. Allow me to answer to the top 3 of arguments I hear:
I need to use the free iText, and iText 7 isn't free / the pdfHTML add-on is closed source.
iText 7 is released using the AGPL, just like iText 5 and XML Worker. The AGPL allows free use in the sense of free of charge in the context of open source projects. If you are distributing a closed source / proprietary product (e.g. you use iText in a SaaS context), you can't use iText for free; in that case, you have to purchase a commercial license. This was already true for iText 5; this is still true for iText 7. As for versions prior to iText 5: you shouldn't use these at all. Regarding pdfHTML: the first versions were indeed only available as closed source software. We have had heavy discussion within iText Group: on the one hand, there were the people who wanted to avoid the massive abuse by companies who don't listen to their developers when those developers tell the powers that be that open source isn't the same as free. Developers were telling us that their boss forced them to do the wrong thing, and that they couldn't convince their boss to purchase a commercial license. On the other hand, there were the people who argued that we shouldn't punish developers for the wrong behavior of their bosses. Eventually, the people in favor of open sourcing pdfHTML, that is: the developers at iText, won the argument. Please prove that they weren't wrong, and use iText correctly: respect the AGPL if you're using iText for free; make sure that your boss purchases a commercial license if you're using iText in a closed source context.
I need to maintain a legacy system, and I have to use an old iText version.
Seriously? Maintenance also involves applying upgrades and migrating to new versions of the software you're using. As you can see, the code needed when using iText 7 and pdfHTML is very simple, and less error-prone than the code needed before. A migration project shouldn't take too long.
I've only just started and I didn't know about iText 7; I only found out after I finished my project.
That's why I'm posting this question and answer. Think of yourself as an eXtreme Programmer. Throw away all of your code, and start anew. You'll notice that it's not as much work as you imagined, and you'll sleep better knowing that you've made your project future-proof because iText 5 is being phased out. We still offer support to paying customers, but eventually, we'll stop supporting iText 5 altogether.
Use iText 7 and this code:
public void generatePDF(String htmlFile) {
try {
//HTML String
String htmlString = htmlFile;
//Setting destination
FileOutputStream fileOutputStream = new FileOutputStream(new File(dirPath + "/USER-16-PF-Report.pdf"));
PdfWriter pdfWriter = new PdfWriter(fileOutputStream);
ConverterProperties converterProperties = new ConverterProperties();
PdfDocument pdfDocument = new PdfDocument(pdfWriter);
//For setting the PAGE SIZE
pdfDocument.setDefaultPageSize(new PageSize(PageSize.A3));
Document document = HtmlConverter.convertToDocument(htmlFile, pdfDocument, converterProperties);
document.close();
}
catch (Exception e) {
e.printStackTrace();
}
}
Convert a static HTML page take also any CSS Style:
HtmlConverter.convertToPdf(new File("./pdf-input.html"),new File("demo-html.pdf"));
For spring Boot user: Convert a dynamic HTML page using SpringBoot and Thymeleaf:
#RequestMapping(path = "/pdf")
public ResponseEntity<?> getPDF(HttpServletRequest request, HttpServletResponse response) throws IOException {
/* Do Business Logic*/
Order order = OrderHelper.getOrder();
/* Create HTML using Thymeleaf template Engine */
WebContext context = new WebContext(request, response, servletContext);
context.setVariable("orderEntry", order);
String orderHtml = templateEngine.process("order", context);
/* Setup Source and target I/O streams */
ByteArrayOutputStream target = new ByteArrayOutputStream();
ConverterProperties converterProperties = new ConverterProperties();
converterProperties.setBaseUri("http://localhost:8080");
/* Call convert method */
HtmlConverter.convertToPdf(orderHtml, target, converterProperties);
/* extract output as bytes */
byte[] bytes = target.toByteArray();
/* Send the response as downloadable PDF */
return ResponseEntity.ok()
.header(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=order.pdf")
.contentType(MediaType.APPLICATION_PDF)
.body(bytes);
}
I am posting this question because many developers ask more or less the same question in different forms. I will answer this question myself (I am the Founder/CTO of iText Group), so that it can be a "Wiki-answer." If the Stack Overflow "documentation" feature still existed, this would have been a good candidate for a documentation topic.
The source file:
I am trying to convert the following HTML file to PDF:
<html>
<head>
<title>Colossal (movie)</title>
<style>
.poster { width: 120px;float: right; }
.director { font-style: italic; }
.description { font-family: serif; }
.imdb { font-size: 0.8em; }
a { color: red; }
</style>
</head>
<body>
<img src="img/colossal.jpg" class="poster" />
<h1>Colossal (2016)</h1>
<div class="director">Directed by Nacho Vigalondo</div>
<div class="description">Gloria is an out-of-work party girl
forced to leave her life in New York City, and move back home.
When reports surface that a giant creature is destroying Seoul,
she gradually comes to the realization that she is somehow connected
to this phenomenon.
</div>
<div class="imdb">Read more about this movie on
IMDB
</div>
</body>
</html>
In a browser, this HTML looks like this:
The problems I encountered:
HTMLWorker doesn't take CSS into account at all
When I used HTMLWorker, I need to create an ImageProvider to avoid an error that informs me that the image can't be found. I also need to create a StyleSheet instance to change some of the styles:
public static class MyImageFactory implements ImageProvider {
public Image getImage(String src, Map<String, String> h,
ChainedProperties cprops, DocListener doc) {
try {
return Image.getInstance(
String.format("resources/html/img/%s",
src.substring(src.lastIndexOf("/") + 1)));
} catch (DocumentException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}
public static void main(String[] args) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream("results/htmlworker.pdf"));
document.open();
StyleSheet styles = new StyleSheet();
styles.loadStyle("imdb", "size", "-3");
HTMLWorker htmlWorker = new HTMLWorker(document, null, styles);
HashMap<String,Object> providers = new HashMap<String, Object>();
providers.put(HTMLWorker.IMG_PROVIDER, new MyImageFactory());
htmlWorker.setProviders(providers);
htmlWorker.parse(new FileReader("resources/html/sample.html"));
document.close();
}
The result looks like this:
For some reason, HTMLWorker also shows the content of the <title> tag. I don't know how to avoid this. The CSS in the header isn't parsed at all, I have to define all the styles in my code, using the StyleSheet object.
When I look at my code, I see that plenty of objects and methods I'm using are deprecated:
So I decided to upgrade to using XML Worker.
Images aren't found when using XML Worker
I tried the following code:
public static final String DEST = "results/xmlworker1.pdf";
public static final String HTML = "resources/html/sample.html";
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
document.open();
XMLWorkerHelper.getInstance().parseXHtml(writer, document,
new FileInputStream(HTML));
document.close();
}
This resulted in the following PDF:
Instead of Times-Roman, the default font Helvetica is used; this is typical for iText (I should have defined a font explicitly in my HTML). Otherwise, the CSS seems to be respected, but the image is missing, and I didn't get an error message.
With HTMLWorker, an exception was thrown, and I was able to fix the problem by introducing an ImageProvider. Let's see if this works for XML Worker.
Not all CSS styles are supported in XML Worker
I adapted my code like this:
public static final String DEST = "results/xmlworker2.pdf";
public static final String HTML = "resources/html/sample.html";
public static final String IMG_PATH = "resources/html/";
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
document.open();
CSSResolver cssResolver =
XMLWorkerHelper.getInstance().getDefaultCssResolver(true);
HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
htmlContext.setImageProvider(new AbstractImageProvider() {
public String getImageRootPath() {
return IMG_PATH;
}
});
PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);
XMLWorker worker = new XMLWorker(css, true);
XMLParser p = new XMLParser(worker);
p.parse(new FileInputStream(HTML));
document.close();
}
My code is much longer, but now the image is rendered:
The image is larger than when I rendered it using HTMLWorker which tells me that the CSS attribute width for the poster class is taken into account, but the float attribute is ignored. How do I fix this?
The remaining question:
So the question boils down to this: I have a specific HTML file that I try to convert to PDF. I have gone through a lot of work, fixing one problem after the other, but there is one specific problem that I can't solve: how do I make iText respect CSS that defines the position of an element, such as float: right?
Additional question:
When my HTML contains form elements (such as <input>), those form elements are ignored.
Why your code doesn't work
As explained in the introduction of the HTML to PDF tutorial, HTMLWorker has been deprecated many years ago. It wasn't intended to convert complete HTML pages. It doesn't know that an HTML page has a <head> and a <body> section; it just parses all the content. It was meant to parse small HTML snippets, and you could define styles using the StyleSheet class; real CSS wasn't supported.
Then came XML Worker. XML Worker was meant as a generic framework to parse XML. As a proof of concept, we decided to write some XHTML to PDF functionality, but we didn't support all of the HTML tags. For instance: forms weren't supported at all, and it was very hard to support CSS that is used to position content. Forms in HTML are very different from forms in PDF. There was also a mismatch between the iText architecture and the architecture of HTML + CSS. Gradually, we extended XML Worker, mostly based on requests from customers, but XML Worker became a monster with many tentacles.
Eventually, we decided to rewrite iText from scratch, with the requirements for HTML + CSS conversion in mind. This resulted in iText 7. On top of iText 7, we created several add-ons, the most important one in this context being pdfHTML.
How to solve the problem
Using the latest version of iText (iText 7.1.0 + pdfHTML 2.0.0) the code to convert the HTML from the question to PDF is reduced to this snippet:
public static final String SRC = "src/main/resources/html/sample.html";
public static final String DEST = "target/results/sample.pdf";
public void createPdf(String src, String dest) throws IOException {
HtmlConverter.convertToPdf(new File(src), new File(dest));
}
The result looks like this:
As you can see, this is pretty much the result you'd expect. Since iText 7.1.0 / pdfHTML 2.0.0, the default font is Times-Roman. The CSS is being respected: the image is now floating on the right.
Some additional thoughts.
Developers often feel opposed to upgrade to a newer iText version when I give the advice to upgrade to iText 7 / pdfHTML 2. Allow me to answer to the top 3 of arguments I hear:
I need to use the free iText, and iText 7 isn't free / the pdfHTML add-on is closed source.
iText 7 is released using the AGPL, just like iText 5 and XML Worker. The AGPL allows free use in the sense of free of charge in the context of open source projects. If you are distributing a closed source / proprietary product (e.g. you use iText in a SaaS context), you can't use iText for free; in that case, you have to purchase a commercial license. This was already true for iText 5; this is still true for iText 7. As for versions prior to iText 5: you shouldn't use these at all. Regarding pdfHTML: the first versions were indeed only available as closed source software. We have had heavy discussion within iText Group: on the one hand, there were the people who wanted to avoid the massive abuse by companies who don't listen to their developers when those developers tell the powers that be that open source isn't the same as free. Developers were telling us that their boss forced them to do the wrong thing, and that they couldn't convince their boss to purchase a commercial license. On the other hand, there were the people who argued that we shouldn't punish developers for the wrong behavior of their bosses. Eventually, the people in favor of open sourcing pdfHTML, that is: the developers at iText, won the argument. Please prove that they weren't wrong, and use iText correctly: respect the AGPL if you're using iText for free; make sure that your boss purchases a commercial license if you're using iText in a closed source context.
I need to maintain a legacy system, and I have to use an old iText version.
Seriously? Maintenance also involves applying upgrades and migrating to new versions of the software you're using. As you can see, the code needed when using iText 7 and pdfHTML is very simple, and less error-prone than the code needed before. A migration project shouldn't take too long.
I've only just started and I didn't know about iText 7; I only found out after I finished my project.
That's why I'm posting this question and answer. Think of yourself as an eXtreme Programmer. Throw away all of your code, and start anew. You'll notice that it's not as much work as you imagined, and you'll sleep better knowing that you've made your project future-proof because iText 5 is being phased out. We still offer support to paying customers, but eventually, we'll stop supporting iText 5 altogether.
Use iText 7 and this code:
public void generatePDF(String htmlFile) {
try {
//HTML String
String htmlString = htmlFile;
//Setting destination
FileOutputStream fileOutputStream = new FileOutputStream(new File(dirPath + "/USER-16-PF-Report.pdf"));
PdfWriter pdfWriter = new PdfWriter(fileOutputStream);
ConverterProperties converterProperties = new ConverterProperties();
PdfDocument pdfDocument = new PdfDocument(pdfWriter);
//For setting the PAGE SIZE
pdfDocument.setDefaultPageSize(new PageSize(PageSize.A3));
Document document = HtmlConverter.convertToDocument(htmlFile, pdfDocument, converterProperties);
document.close();
}
catch (Exception e) {
e.printStackTrace();
}
}
Convert a static HTML page take also any CSS Style:
HtmlConverter.convertToPdf(new File("./pdf-input.html"),new File("demo-html.pdf"));
For spring Boot user: Convert a dynamic HTML page using SpringBoot and Thymeleaf:
#RequestMapping(path = "/pdf")
public ResponseEntity<?> getPDF(HttpServletRequest request, HttpServletResponse response) throws IOException {
/* Do Business Logic*/
Order order = OrderHelper.getOrder();
/* Create HTML using Thymeleaf template Engine */
WebContext context = new WebContext(request, response, servletContext);
context.setVariable("orderEntry", order);
String orderHtml = templateEngine.process("order", context);
/* Setup Source and target I/O streams */
ByteArrayOutputStream target = new ByteArrayOutputStream();
ConverterProperties converterProperties = new ConverterProperties();
converterProperties.setBaseUri("http://localhost:8080");
/* Call convert method */
HtmlConverter.convertToPdf(orderHtml, target, converterProperties);
/* extract output as bytes */
byte[] bytes = target.toByteArray();
/* Send the response as downloadable PDF */
return ResponseEntity.ok()
.header(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=order.pdf")
.contentType(MediaType.APPLICATION_PDF)
.body(bytes);
}
I am developing an Application in Xamarin for Android. I have already generated HTML file using StringBuilder. Now I have a HTML file in my External storage and the same template is required for PDF. So when I try to convert HTML to PDF using iTextSharp using XML Worker & PDFSharp libraries, I am getting build errors due to missing System.Drawing.dll. Then I found from Xamarin forums & Stackoverflow links that it is not supported for Xamarin.Android.
Can anyone please tell me other alternative about how to create template for PDF or any other working nuget package for Xamarin.Android which will convert html file to pdf.
NOTE: I am able to generate PDF but not able to convert HTML to PDF.
It would be of great help!. Thanks a ton!.
Use Nuget package Xam.iTextSharpLGPL
Below is the sample code
using System.IO;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.html.simpleparser;
using Android.Graphics;
string path = Android.OS.Environment.ExternalStorageDirectory.Path;
string pdfPath = System.IO.Path.Combine(path, "samplee.pdf");
System.IO.FileStream fs = new FileStream(pdfPath, FileMode.Create);
Document document = new Document(PageSize.A4);
PdfWriter writer = PdfWriter.GetInstance(document, fs);
HTMLWorker worker = new HTMLWorker(document);
document.Open();
StringBuilder html = new StringBuilder();
html.Append("<? xml version='1.0' encoding='utf-8' ?><html><head><title></title></head>");
html.Append("<CENTER>Simple Sample html</H1>");
html.Append("<H4>By User1</H4>");
html.Append("<H2>Demonstrating a few HTML features</H2>");
html.Append("</CENTER>");
html.Append("<p>HTML doesn't normally use line breaks for ordinary text. A white space of any size is treated as a single space. This is because the author of the page has no way of knowing the size of the reader's screen, or what size type they will have their browser set for.");
html.Append("</p></body</html>");
TextReader reader = new StringReader(html.ToString());
worker.StartDocument();
worker.Parse(reader);
worker.EndDocument();
worker.Close();
document.Close();
writer.Close();
fs.Close();