I'm attempting to split a PDF file page by page, and get each page file's byte array. However, I'm having trouble converting each page to byte array in iText version 7.0.4 for C#.
Methods referenced in other solutions rely on PdfWriter.GetInstance or PdfCopy, which seems to no longer exist in iText version 7.0.4.
I've gone through iText's sample codes and API documents, but I have not been able to extract any useful information out of them.
using (Stream stream = new MemoryStream(pdfBytes))
using (PdfReader reader = new PdfReader(stream))
using (PdfDocument pdfDocument = new PdfDocument(reader))
{
PdfSplitter splitter = new PdfSplitter(pdfDocument);
// My Attempt #1 - None of the document's functions seem to be of help.
foreach (PdfDocument splitPage in splitter.SplitByPageCount(1))
{
// ??
}
// My Attempt #2 - GetContentBytes != pdf file bytes.
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); i++)
{
PdfPage page = pdfDocument.GetPage(i);
byte[] bytes = page.GetContentBytes();
}
}
Any help would be much appreciated.
Your approach of using PdfSplitter is one of the best ways to approach your task. Maybe not so much is available out of the box, but PdfSplitter is highly customizable and if you take a look at the implementation or simply the API, it becomes clear which are correct points for injecting your own customized behavior.
You should override GetNextPdfWriter to provide any output media you want the documents to be created at. You can also use IDocumentReadyListener to define the action that will be performed once another document is ready.
I am attaching one of the implementations that can achieve your goal:
class ByteArrayPdfSplitter : PdfSplitter {
private MemoryStream currentOutputStream;
public ByteArrayPdfSplitter(PdfDocument pdfDocument) : base(pdfDocument) {
}
protected override PdfWriter GetNextPdfWriter(PageRange documentPageRange) {
currentOutputStream = new MemoryStream();
return new PdfWriter(currentOutputStream);
}
public MemoryStream CurrentMemoryStream {
get { return currentOutputStream; }
}
public class DocumentReadyListender : IDocumentReadyListener {
private ByteArrayPdfSplitter splitter;
public DocumentReadyListender(ByteArrayPdfSplitter splitter) {
this.splitter = splitter;
}
public void DocumentReady(PdfDocument pdfDocument, PageRange pageRange) {
pdfDocument.Close();
byte[] contents = splitter.CurrentMemoryStream.ToArray();
String pageNumber = pageRange.ToString();
}
}
}
The calls would be basically as you did, but with custom document ready event:
PdfDocument docToSplit = new PdfDocument(new PdfReader(path));
ByteArrayPdfSplitter splitter = new ByteArrayPdfSplitter(docToSplit);
splitter.SplitByPageCount(1, new ByteArrayPdfSplitter.DocumentReadyListender(splitter));
Related
I have a created sample project for try to test the itext7.pdfHTML library. I think this library supports the English language only, But I think it's impossible to remember only one language support. Please help me to fix it to support multiple language. I will convert html to pdf only and I use itext7.pdfhtml 3.0.4
Controller
public IActionResult TestPDFHtml()
{
string html = #"<html><head><meta http-equiv=""content-type"" content=""text/html""; charset=""UTF-8""></head><body>สวัสดี</body>";
TestPDF test = new TestPDF();
byte[] vs = test.creatPDFByte(html);
return File(vs, "application/pdf");
}
Code method creatPDFByte
public byte[] creatPDFByte(string pdfHTML)
{
byte[] buffer;
try
{
using (MemoryStream ms = new MemoryStream())
{
using (PdfWriter pw = new PdfWriter(ms))
{
pw.SetCloseStream(true);
using (PdfDocument pdfDoc = new PdfDocument(pw))
{
ConverterProperties cProps = new ConverterProperties();
cProps.SetCharset("UTF-8");
pdfDoc.SetDefaultPageSize(PageSize.A4);
pdfDoc.SetCloseWriter(true);
pdfDoc.SetCloseReader(true);
pdfDoc.SetFlushUnusedObjects(true);
HtmlConverter.ConvertToPdf(pdfHTML, pdfDoc, cProps);
}
}
buffer = ms.ToArray();
}
}
catch ...
why do you think so? Have you tried adding fonts in css because the default font may not support your language.
I need to merge N PDF files into one. I create a blank file first
byte[] pdfBytes = null;
var ms = new MemoryStream();
var doc = new iTextSharp.text.Document();
var cWriter = new PdfCopy(doc, ms);
Later I cycle through html strings array
foreach (NBElement htmlString in someElement.Children())
{
byte[] msTempDoc = getPdfDocFrom(htmlString.GetString(), cssString.GetString());
addPagesToPdf(cWriter, msTempDoc);
}
In getPdfDocFrom I create pdf file using XMLWorkerHelper and return it as byte array
private byte[] getPdfDocFrom(string htmlString, string cssString)
{
var tempMs = new MemoryStream();
byte[] tempMsBytes;
var tempDoc = new iTextSharp.text.Document();
var tempWriter = PdfWriter.GetInstance(tempDoc, tempMs);
tempDoc.Open();
using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(cssString)))
{
using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(htmlString)))
{
//Parse the HTML
iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(tempWriter, tempDoc, msHtml, msCss);
tempMsBytes = tempMs.ToArray();
}
}
tempDoc.Close();
return tempMsBytes;
}
Later on I try to add pages from this PDF file to the blank one.
private static void addPagesToPdf(PdfCopy mainDocWriter, byte[] sourceDocBytes)
{
using (var msOut = new MemoryStream())
{
PdfReader reader = new PdfReader(new MemoryStream(sourceDocBytes));
int n = reader.NumberOfPages;
PdfImportedPage page;
for (int i = 1; i <= n; i++)
{
page = mainDocWriter.GetImportedPage(reader, i);
mainDocWriter.AddPage(page);
}
}}
It breaks when it tries to create a PdfReader from the byte array I pass to the function. "Rebuild failed: trailer not found.; Original message: PDF startxref not found."
I used another library to work with PDF before. I passed 2 PdfDocuments as an objects and just added pages from one to another in cycle. It didn't support Css though, so I had to switch to ITextSharp.
I don't quite get the difference between PdfWriter and PdfCopy.
There a logical error in your code. When you create a document from scratch as is done in the getPdfDocFrom() method, the document isn't complete until you've triggered the Close() method. In this Close() method, a trailer is created as well as a cross-reference (xref) table. The error tells you that those are missing.
Indeed, you do call the Close() method:
tempDoc.Close();
But by the time you Close() the document, it's too late: you have already created the tempMsBytes array. You need to create that array after you close the document.
Edit: I don't know anything about C#, but if MemoryStream clears its buffer after closing it, you could use mainDocWriter.CloseStream = false; so that the MemoryStream isn't closed when you close the document.
In Java, it would be a bad idea to set the "close stream" parameter to false. When I read the answers to the question Create PDF in memory instead of physical file I see that C# probably doesn't always require this extra line.
Remark: merging files by adding PdfImportedPage instances to a PdfWriter is an example of bad taste. If you are using iTextSharp 5 or earlier, you should use PdfCopy or PdfSmartCopy to do that. If you use PdfWriter, you throw away a lot of information (e.g. link annotations).
my target is to open an existing pdf, add or remove some pages while preserving the metadata (Author, Subject, ...) in a Windows.Forms C# application.
I use iTextSharp and found examples how to add or remove pages by using the PdfConcatenate class. To keep the metadata I use a PdfStamper afterwards. To speed things up I want to do the modifications in memory before storing the result to disk.
The problem is NOT adding or removing the pages but to keep the metadata in the same step.
So can anybody tell me/giva an example on how to achieve this (better) or am I on the completely wrong track?
Here my current code (see comments for problem related lines):
public void RemovePagesInFile(string documentLocation, int pageIndexFrom, int pageCount)
{
// TB: open the pdf
using (PdfReader sourcePdfReader = new PdfReader(documentLocation))
using (MemoryStream concatenatedTargetStream = new MemoryStream((int)sourcePdfReader.FileLength))
{
// TB: use a concatenator to create a new pdf containing only the desired pages
PdfConcatenate concatenator = new PdfConcatenate(concatenatedTargetStream);
// TB: create a list with the page numbers to keep
List<int> pagesToKeep = new List<int>();
for (int i = 1; i <= pageIndexFrom; i++)
{
pagesToKeep.Add(i);
}
for (int i = pageIndexFrom + pageCount + 1; i <= sourcePdfReader.NumberOfPages; i++)
{
pagesToKeep.Add(i);
}
// TB: execute the page copy
sourcePdfReader.SelectPages(pagesToKeep);
concatenator.AddPages(sourcePdfReader);
// TB: problem(s) here:
// 1. when calling concatenator.Close() the memory stream gets disposed as expected.
// concatenator.Close();
// 2. even when calling concatenator.WriterFlush() the memory stream seems to be missing content (error when creating targetReader (see below)).
// concatenator.Writer.Flush();
// 3. when keeping concatenator open the same error as above occures (I assume not all bytes have been written to the memory stream)
// TB: preserve the meta data from the source document
// => ERROR here: "Rebuild trailer not found. Original Error: PDF startxref not found"
using (PdfReader targetReader = new PdfReader(concatenatedTargetStream))
using (MemoryStream targetStream = new MemoryStream((int)concatenatedTargetStream.Length))
{
using (PdfStamper stamper = new PdfStamper(targetReader, targetStream))
{
stamper.MoreInfo = sourcePdfReader.Info;
// TB: same problem as above with stamper ?
stamper.Close();
}
// TB: close the reader to be able to access the source pdf
sourcePdfReader.Close();
// TB: write the modified pdf to the disk
File.WriteAllBytes(documentLocation, targetStream.ToArray());
}
}
}
Two changes need to be made. Call
concatenator.Writer.CloseStream = false
before calling
concatenator.Close()
Do the same thing for the PdfStamper and you're set.
Wonder if this possible. Saw many posts on adding watermark after the pdf is created and saved in disk. But during creation of document how do i add a image watermark. I know how to add a image to document. But how do i position it such that it comes at the end of page.
For C#, use this code...
//new Document
Document DOC = new Document();
// open Document
DOC.Open();
//create New FileStream with image "WM.JPG"
FileStream fs1 = new FileStream("WM.JPG", FileMode.Open);
iTextSharp.text.Image JPG = iTextSharp.text.Image.GetInstance(System.Drawing.Image.FromStream(fs1), ImageFormat.Jpeg);
//Scale image
JPG.ScalePercent(35f);
//Set position
JPG.SetAbsolutePosition(130f,240f);
//Close Stream
fs1.Close();
DOC.Add(JPG);
This is essentially identical to adding a header or footer.
You need to create a class that implements PdfPageEvent, and in the OnPageEnd, grab the page's PdfContentByte, and draw your image there. Use an absolute position.
Note: You probably want to derive from PdfPageEventHelper, it has empty implementations of all the page events, so you just need to write the method you actually care about.
Note: Unless your image is mostly transparent, drawing it on top of your page will cover up Many Things. IIRC ("If I Recall Correctly"), PNG and GIF files added by iText will automatically be properly masked, allowing things under them to show through.
If you want to add an opaque image underneath everything, you should override OnStartPage() instead.
This is Java, but converting it is mostly a matter of capitalizing method names and swapping get/set calls for property access.
Image watermarkImage = new Image(imgPath);
watermarkImage.setAbsolutePosition(x, y);
writer.setPageEvent( new MyPageEvent(watermarkImage) );
public MyPageEvent extends PdfPageEventHelper {
private Image waterMark;
public MyPageEvent(Image img) {
waterMark = img;
}
public void OnEndPage/*OnStartPage*/(PdfWriter writer, Document doc) {
PdfContentByte content = writer.getContent();
content.addImage( waterMark );
}
}
This is the accepted answer's port to C#, and what worked for me. I'm using an A4 page size:
Define this BackgroundImagePdfPageEvent class:
public class BackgroundImagePdfPageEvent : PdfPageEventHelper
{
private readonly Image watermark;
public BackgroundImagePdfPageEvent(string imagePath)
{
using (var fs = new FileStream(imagePath, FileMode.Open))
{
watermark = Image.GetInstance(System.Drawing.Image.FromStream(fs), ImageFormat.Jpeg);
watermark.SetAbsolutePosition(0, 0);
watermark.ScaleAbsolute(PageSize.A4.Width, PageSize.A4.Height);
watermark.Alignment = Image.UNDERLYING;
}
}
public override void OnStartPage(PdfWriter writer, Document document)
{
document.Add(watermark);
}
}
Then, when creating your document:
var doc = new Document(PageSize.A4);
doc.SetMargins(60f, 60f, 120f, 60f);
var outputStream = new MemoryStream();
var writer = PdfWriter.GetInstance(doc, outputStream);
var imagePath = "PATH_TO_YOUR_IMAGE";
writer.PageEvent = new BackgroundImagePdfPageEvent(imagePath);
I need to generate an XML file and i need to stick as much data into it as possible BUT there is a filesize limit. So i need to keep inserting data until something says no more. How do i figure out the XML file size without repeatably writing it to file?
I agree with John Saunders. Here's some code that will basically do what he's talking about but as an XmlSerializer except as a FileStream and uses a MemoryStream as intermediate storage. It may be more effective to extend stream though.
public class PartitionedXmlSerializer<TObj>
{
private readonly int _fileSizeLimit;
public PartitionedXmlSerializer(int fileSizeLimit)
{
_fileSizeLimit = fileSizeLimit;
}
public void Serialize(string filenameBase, TObj obj)
{
using (var memoryStream = new MemoryStream())
{
// serialize the object in the memory stream
using (var xmlWriter = XmlWriter.Create(memoryStream))
new XmlSerializer(typeof(TObj))
.Serialize(xmlWriter, obj);
memoryStream.Seek(0, SeekOrigin.Begin);
var extensionFormat = GetExtensionFormat(memoryStream.Length);
var buffer = new char[_fileSizeLimit];
var i = 0;
// split the stream into files
using (var streamReader = new StreamReader(memoryStream))
{
int readLength;
while ((readLength = streamReader.Read(buffer, 0, _fileSizeLimit)) > 0)
{
var filename
= Path.ChangeExtension(filenameBase,
string.Format(extensionFormat, i++));
using (var fileStream = new StreamWriter(filename))
fileStream.Write(buffer, 0, readLength);
}
}
}
}
/// <summary>
/// Gets the a file extension formatter based on the
/// <param name="fileLength">length of the file</param>
/// and the max file length
/// </summary>
private string GetExtensionFormat(long fileLength)
{
var numFiles = fileLength / _fileSizeLimit;
var extensionLength = Math.Ceiling(Math.Log10(numFiles));
var zeros = string.Empty;
for (var j = 0; j < extensionLength; j++)
{
zeros += "0";
}
return string.Format("xml.part{{0:{0}}}", zeros);
}
}
To use it, you'd initialize it with the max file length and then serialize using the base file path and then the object.
public class MyType
{
public int MyInt;
public string MyString;
}
public void Test()
{
var myObj = new MyType { MyInt = 42,
MyString = "hello there this is my string" };
new PartitionedXmlSerializer<MyType>(2)
.Serialize("myFilename", myObj);
}
This particular example will generate an xml file partitioned into
myFilename.xml.part001
myFilename.xml.part002
myFilename.xml.part003
...
myFilename.xml.part110
In general, you cannot break XML documents at arbitrary locations, even if you close all open tags.
However, if what you need is to split an XML document over multiple files, each of no more than a certain size, then you should create your own subtype of the Stream class. This "PartitionedFileStream" class could write to a particular file, up to the size limit, then create a new file, and write to that file, up to the size limit, etc.
This would leave you with multiple files which, when concatenated, make up a valid XML document.
In the general case, closing tags will not work. Consider an XML format that must contain one element A followed by one element B. If you closed the tags after writing element A, then you do not have a valid document - you need to have written element B.
However, in the specific case of a simple site map file, it may be possible to just close the tags.
You can ask the XmlTextWriter for it's BaseStream, and check it's Position.
As the other's pointed out, you may need to reserve some headroom to properly close the Xml.