itextsharp(xmlworker) parsing is slow - c#

I have been using iTextSharp for converting a MVC view to pdf .the view uses inline styling. Everything works fine with below code but the parsing is slow-
using (var ms = new MemoryStream())
{
using (var doc = new Document(PageSize.A4, 0, 1, 0,0))
{
using (var writer = PdfWriter.GetInstance(doc, ms))
{
doc.Open();
XMLWorkerHelper.GetInstance().ParseXHtml(writer,doc, htmlcontent);
//Above line is too slow
doc.Close();
}
}
as suggested by the experts here I moved on to below modifications-
registering fonts
Moved stylinging to diffrent css file
Now i am using the below code but the generated pdf is blank. it does retain the style but no fonts and even this approach takes same time to parse
using (var ms = new MemoryStream())
{
using (var doc = new Document(PageSize.A4, 0, 1, 0,0))
{
using (var writer = PdfWriter.GetInstance(doc, ms))
{
doc.Open();
// css
var cssResolver = new StyleAttrCSSResolver();
var cssFile = XMLWorkerHelper.GetCSS((new FileStream(Server.MapPath("~/Content/scptpdf.css"), FileMode.Open, FileAccess.Read)));
cssResolver.AddCss(cssFile);
// html
var fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.Register(Server.MapPath("~/Content/fonts/arial.ttf"));
fontProvider.Register(Server.MapPath("~/Content/fonts/arialbd."));
fontProvider.AddFontSubstitute("calibri","ARIAL");
var cssAppliers = new CssAppliersImpl(fontProvider);
var htmlContext = new HtmlPipelineContext(cssAppliers);
htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
var pdf = new PdfWriterPipeline(doc, writer);
var html = new HtmlPipeline(htmlContext, pdf);
var css = new CssResolverPipeline(cssResolver, html);
var worker = new XMLWorker(css,true);
var p = new XMLParser(worker);
byte[] byteArray = Encoding.UTF8.GetBytes(pdftext);
var htmlstream = new MemoryStream(byteArray);
p.Parse(htmlstream);
//XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, htmlcontent);
doc.Close();
}
}
I need to over the latency. Can some help with this. Thanks in advance.

I removed the font types. Now iTEXT sharp uses its OWN . Its fast too.

Related

Adding PDF from stream while creating PDF with iTextSharp

I have to create a new PDF document, but I also have to attach existing PDF files to the new one.
It was easy enough to attach them at the end. But I have to attach these in the middle of the document so its easier to know which is related to which.
So, essentially it would look liked this:
New coverpage (from html)
New invoicerow (from html)
old invoiceRow (from stream)
New invoicerow (from html)
old invoiceRow (from stream)
But the issue I am running into is that the file either doesn't accept adding the stream from the already existing PDF, which means I only get the new generated rows. Or one of the existing PDFs is all I am seeing.
The code I have so far, which works for generating the new rows like I want them (before adding the existing PDFs that is) looks like this:
private async Task<byte[]> CreateHtmlString(List<Invoice> invoices)
{
byte[] bytes;
using (MemoryStream memoryStream = new MemoryStream())
{
using (Document document = new Document(PageSize.A4, 10F, 10F, 10F, 0F))
{
using (PdfWriter pdfWriter = PdfWriter.GetInstance(document, memoryStream))
{
pdfWriter.CloseStream = false;
if (!document.IsOpen())
{
document.Open();
}
StringBuilder sbHeader = new StringBuilder();
sbHeader.Append("<!DOCTYPE html>");
sbHeader.Append("<html>");
sbHeader.Append("<body>");
sbHeader.Append("<table style='table-layout: auto; width: 100%;'>");
sbHeader.Append("<tr>");
sbHeader.Append("</tr>");
foreach (var invoice in invoices)
{
sbHeader.Append("<tr>");
sbHeader.Append("</tr>");
}
sbHeader.Append("</table>");
sbHeader.Append("</body>");
sbHeader.Append("</html>");
using (StringReader srHtml = new StringReader(sbHeader.ToString()))
{
HTMLWorker htmlparser = new HTMLWorker(document);
using (MemoryStream ms = new MemoryStream())
{
htmlparser.Parse(srHtml);
}
}
foreach (var invoice in invoices)
{
document.NewPage();
StringBuilder sbRow = new StringBuilder();
sbRow.Append("<div style='page-break-before:always'> </div>");
sbRow.Append("<table style='table-layout: auto; width: 100%;'>");
sbRow.Append("<tr>");
sbRow.Append("</tr>");
foreach (var acknowledgeJournal in invoice.AcknowledgementJournals)
{
sbRow.Append("<tr>");
sbRow.Append("</tr>");
}
sbRow.Append("</table>");
sbRow.Append("<table style='table-layout: auto; width: 100%;'>");
sbRow.Append("<tr>");
sbRow.Append("</tr>");
foreach (var invoiceItem in invoice.InvoiceItems)
{
sbRow.Append("<tr>");
sbRow.Append("</tr>");
}
sbRow.Append("</table>");
sbRow.Append("</body>");
sbRow.Append("</html>");
using (StringReader srHtml = new StringReader(sbRow.ToString()))
{
HTMLWorker htmlparser = new HTMLWorker(document);
using (MemoryStream ms = new MemoryStream())
{
htmlparser.Parse(srHtml);
foreach (var attachment in invoice.Attachments)
{
var retrievedAttachment = await getPdf();
retrievedAttachment.CopyTo(memoryStream);
}
}
}
}
bytes = memoryStream.ToArray();
string outputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Combined.pdf");
using (FileStream fs = new FileStream(outputFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
fs.Write(bytes, 0, bytes.Length);
}
return bytes;
}
}
}
}
Is it at all possible to do what I want to do this way? Or will I have to change it, and make one stream per page, and then merge them after creation?

Parser results in "The document has no pages"

Trying to generate a PDF with HTML/CSS by using ITextSharp v5. The error I get is "Document has no pages". Is my parser set up wrong? How do I get the parsed HTML added to my document?
public void ConvertHtmlToPdf(string xHtml, string css)
{
using (var stream = new FileStream("App_Data/pdfs/testt.pdf", FileMode.Create))
{
using (var document = new Document(PageSize.A4, 10f, 10f, 10f, 0f))
{
var writer = PdfWriter.GetInstance(document, stream);
document.Open();
// instantiate custom tag processor and add to `HtmlPipelineContext`.
var tagProcessorFactory = Tags.GetHtmlTagProcessorFactory();
var htmlPipelineContext = new HtmlPipelineContext(null);
htmlPipelineContext.SetTagFactory(tagProcessorFactory);
var pdfWriterPipeline = new PdfWriterPipeline(document, writer);
var htmlPipeline = new HtmlPipeline(htmlPipelineContext, pdfWriterPipeline);
// get an ICssResolver and add the custom CSS
var cssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(true);
cssResolver.AddCss(css, "utf-8", true);
var cssResolverPipeline = new CssResolverPipeline(
cssResolver, htmlPipeline
);
var worker = new XMLWorker(cssResolverPipeline, true);
var parser = new XMLParser(worker);
using (var stringReader = new StringReader(xHtml))
{
parser.Parse(stringReader);
}
document.Close();
writer.Close();
}
}
}
The document is empty as no data is written to it from Worker classes.
Immediately after opening the document, always add an empty chunk to document so that you can avoid this exception.
document.add(new Chunk(''));
To Convert HTML to PDF, you can check this example
How to convert HTML to PDF using iText

Generate one pdf document with multiple pages converting from html using IText 7

I'm working with IText 7, I've been able to get one html page and generate a pdf for that page, but I need to generate one pdf document from multiple html pages and separated by pages. For example: I have Page1.html, Page2.html and Page3.html. I will need a pdf document with 3 pages, the first page with the content of Page1.html, second page with the content of Page2.html and like that...
This is the code I have and it's working for one html page:
ConverterProperties properties = new ConverterProperties();
PdfWriter writer = new PdfWriter(pdfRoot, new WriterProperties().SetFullCompressionMode(true));
PdfDocument pdfDocument = new PdfDocument(writer);
pdfDocument.AddEventHandler(PdfDocumentEvent.END_PAGE, new HeaderPdfEventHandler());
HtmlConverter.ConvertToPdf(htmlContent, pdfDocument, properties);
Is it possible to loop against the multiple html pages, add a new page to the PdfDocument for every html page and then have only one pdf generated with one page per html page?
UPDATE
I've been following this example and trying to translate it from Java to C#, I'm trying to use PdfMerger and loop around the html pages... but I'm receiving the Exception Cannot access a closed stream, on this line:
temp = new PdfDocument(
new PdfReader(new RandomAccessSourceFactory().CreateSource(baos), rp));
It looks like is related to the ByteArrayOutputStream baos instance. Any suggestions? This is my current code:
foreach (var html in htmlList)
{
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PdfDocument temp = new PdfDocument(new PdfWriter(baos));
HtmlConverter.ConvertToPdf(html, temp, properties);
ReaderProperties rp = new ReaderProperties();
temp = new PdfDocument(
new PdfReader(new RandomAccessSourceFactory().CreateSource(baos), rp));
merger.Merge(temp, 1, temp.GetNumberOfPages());
temp.Close();
}
pdfDocument.Close();
You are using RandomAccessSourceFactory and passing there a closed stream which you wrote a PDF document into. RandomAccessSourceFactory expects an input stream instead that is ready to be read.
First of all you should use MemoryStream which is native to .NET world. ByteArrayOutputStream is the class that was ported from Java for internal purposes (although it extends MemoryStream as well). Secondly, you don't have to use RandomAccessSourceFactory - there is a simpler way.
You can create a new MemoryStream instance from the bytes of the MemoryStream that you used to create a temporary PDF with the following line:
baos = new MemoryStream(baos.ToArray());
As an additional remark, it's better to close PdfMerger instance directly instead of closing the document - closing PdfMerger closes the underlying document as well.
All in all, we get the following code that works:
foreach (var html in htmlList)
{
MemoryStream baos = new MemoryStream();
PdfDocument temp = new PdfDocument(new PdfWriter(baos));
HtmlConverter.ConvertToPdf(html, temp, properties);
ReaderProperties rp = new ReaderProperties();
baos = new MemoryStream(baos.ToArray());
temp = new PdfDocument(new PdfReader(baos, rp));
pdfMerger.Merge(temp, 1, temp.GetNumberOfPages());
temp.Close();
}
pdfMerger.Close();
Maybe not so succinctly. I use "using". Similar answer
private byte[] CreatePDF(string html)
{
byte[] binData;
using (var workStream = new MemoryStream())
{
using (var pdfWriter = new PdfWriter(workStream))
{
//Create one pdf document
using (var pdfDoc = new PdfDocument(pdfWriter))
{
pdfDoc.SetDefaultPageSize(iText.Kernel.Geom.PageSize.A4.Rotate());
//Create one pdf merger
var pdfMerger = new PdfMerger(pdfDoc);
//Create two identical pdfs
for (int i = 0; i < 2; i++)
{
using (var newStream = new MemoryStream(CreateDocument(html)))
{
ReaderProperties rp = new ReaderProperties();
using (var newPdf = new PdfDocument(new PdfReader(newStream, rp)))
{
pdfMerger.Merge(newPdf, 1, newPdf.GetNumberOfPages());
}
}
}
}
binData = workStream.ToArray();
}
}
return binData;
}
Create pdf
private byte[] CreateDocument(string html)
{
byte[] binData;
using (var workStream = new MemoryStream())
{
using (var pdfWriter = new PdfWriter(workStream))
{
using (var pdfDoc = new PdfDocument(pdfWriter))
{
pdfDoc.SetDefaultPageSize(iText.Kernel.Geom.PageSize.A4.Rotate());
ConverterProperties props = new ConverterProperties();
using (var document = HtmlConverter.ConvertToDocument(html, pdfDoc, props))
{
}
}
binData = workStream.ToArray();
}
}
return binData;
}

itextsharp html to pdf

I want to change some HTML in a pdf. All my html is in HTML string but I don't know how to pass it in correctly within iTextSharp.
public void PDF()
{
// Create a doc object
var doc = new doc(PageSize.A4, 50, 50, 25, 25);
// Create a new PdfWrite object, writing the output to the file ~/PDFTemplate/SimpleFormFieldDemo.pdf
var output = new FileStream(Server.MapPath("t.pdf"), FileMode.Create);
var writer = PdfWriter.GetInstance(doc, output);
// Open the doc for writing
doc.Open();
//Add Wallpaper image to the pdf
var Wallpaper = iTextSharp.text.Image.GetInstance(Server.MapPath("hfc.png"));
Wallpaper.SetAbsolutePosition(0, 0);
Wallpaper.ScaleAbsolute(600, 840);
doc.Add(Wallpaper);
iTextSharp.text.html.simpleparser.HTMLWorker hw = new iTextSharp.text.html.simpleparser.HTMLWorker(doc);
StyleSheet css = new StyleSheet();
css.LoadTagStyle("body", "face", "Garamond");
css.LoadTagStyle("body", "encoding", "Identity-H");
css.LoadTagStyle("body", "size", "12pt");
hw.Parse(new StringReader(HTML));
doc.Close();
Response.Redirect("t.pdf");
}
If anyone knows how to make this work.. it be good.
Thanks
Dom
Please download The Best iText Questions on StackOverflow. It's a free ebook, you'll benefit from it.
Once you have downloaded is, go to the section entitled "Parsing XML and XHTML".
Allow me to quote from the answer to this question: RowSpan does not work in iTextSharp?
You are using HTMLWorker instead of XML Worker, and you are right:
HTMLWorker has no support for CSS. Saying CSS doesn't work in
iTextSharp is wrong. It doesn't work when you use HTMLWorker, but
that's documented: the CSS you need works in XML Worker.
Please throw away your code, and start anew using XML Worker.
There are many examples (simple ones as well as complex ones) in the book. Let me give you only one:
using (var fsOut = new FileStream(outputFile, FileMode.Create, FileAccess.Write))
using (var stringReader = new StringReader(result))
{
var document = new Document();
var pdfWriter = PdfWriter.GetInstance(document, fsOut);
pdfWriter.InitialLeading = 12.5f;
document.Open();
var xmlWorkerHelper = XMLWorkerHelper.GetInstance();
var cssResolver = new StyleAttrCSSResolver();
var xmlWorkerFontProvider = new XMLWorkerFontProvider();
foreach (string font in fonts)
{
xmlWorkerFontProvider.Register(font);
}
var cssAppliers = new CssAppliersImpl(xmlWorkerFontProvider);
var htmlContext = new HtmlPipelineContext(cssAppliers);
htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
PdfWriterPipeline pdfWriterPipeline = new PdfWriterPipeline(document, pdfWriter);
HtmlPipeline htmlPipeline = new HtmlPipeline(htmlContext, pdfWriterPipeline);
CssResolverPipeline cssResolverPipeline = new CssResolverPipeline(cssResolver, htmlPipeline);
XMLWorker xmlWorker = new XMLWorker(cssResolverPipeline, true);
XMLParser xmlParser = new XMLParser(xmlWorker);
xmlParser.Parse(stringReader);
document.Close();
}
}
(Source: iTextSharp XmlWorker: right-to-left)
If you want an easier example, take a look at the answers of these questions:
How to parse multiple HTML files into a single PDF?
How to add a rich Textbox (HTML) to a table cell?
...
The code that parses an HTML string and a CSS string to a list of iText(Sharp) elements is as simple as this:
ElementList list = XMLWorkerHelper.parseToElementList(html, css);
You can find more examples on the official iText web site.

itextsharp "the document has no pages" error when i have anchor tag

I am converting some html to pdf. It is working fine but when i have anchor tag in my html i get error the document has no pages
My code is
byte[] data;
using (var sr = new StringReader(sw.ToString()))
{
var st = new StyleSheet();
GetStyleSheetForUnicodeCharacters(st);
using (var ms = new MemoryStream())
{
using (var pdfDoc = new Document())
{
using (var w = PdfWriter.GetInstance(pdfDoc, ms))
{
pdfDoc.Open();
var parsedHtmlElements = HTMLWorker.ParseToList(sr, st);
foreach (var htmlElement in parsedHtmlElements)
{
pdfDoc.Add(htmlElement as IElement);
}
pdfDoc.Close();
data = ms.ToArray();
}
}
}
}
The problem may be invalid html. One way to check is to run your html source through a validator such W3C Markup Validation Service.
have you already tried to add a Page with:
pdfDoc.NewPage();
I think your Code should look like this:
byte[] data;
using (var sr = new StringReader(sw.ToString()))
{
var st = new StyleSheet();
GetStyleSheetForUnicodeCharacters(st);
using (var ms = new MemoryStream())
{
using (var pdfDoc = new Document())
{
using (var w = PdfWriter.GetInstance(pdfDoc, ms))
{
pdfDoc.Open();
pdfDoc.NewPage(); // add Page here
var parsedHtmlElements = HTMLWorker.ParseToList(sr, st);
foreach (var htmlElement in parsedHtmlElements)
{
pdfDoc.Add(htmlElement as IElement);
}
pdfDoc.Close();
data = ms.ToArray();
}
}
}
}
You can also add a blank Page by using:
pdfDoc.newPage();
w.setPageEmpty(false);
MfG chris
Need to check that any html tags are mismatched. Example /td>, this types of mistake raised above error.

Categories

Resources