I am writing a little tool for myself to merge PDF files using PDFSharp library. I am using the latest pre-release version (1.5) of PDFSharp.
I come across a problem where documents that are loaded into memory is not releasing when going out of scope. I tracked down this memory leak to the following part of the code:
using (var mergedDocument = new PdfDocument())
{
for (var i = 0; i < SelectedDocuments.Count; i++)
{
using (var document = PdfReader.Open(SelectedDocuments[i].FilePath, PdfDocumentOpenMode.Import))
{
for (var j = 0; j < document.PageCount; j++)
{
mergedDocument.AddPage(document.Pages[j]);
}
}
}
mergedDocument.Save(savePath);
}
An example would be I have 10 pdf documents which totals on 178 Mb. The merged document that is created is also around 178 Mb. When above code finishes executing memory usage holds at 356 Mb. When I merge more documents this memory leak keeps going up and eventually cause a crash.
I tried removing using statements and using Dispose() when I wanted the document to be released from memory, however it does not work as well.
Any help would be appreciated. Thank you.
Edit:
To be more precise:
var parentDirectory = Directory.GetParent(SelectedDocuments[0].FilePath);
var savePath = parentDirectory + "\\MergedDocument.pdf";
using (var mergedDocument = new PdfDocument())
{
for (var i = 0; i < SelectedDocuments.Count; i++)
{
using (var document = PdfReader.Open(SelectedDocuments[i].FilePath, PdfDocumentOpenMode.Import))
{
for (var j = 0; j < document.PageCount; j++)
{
mergedDocument.AddPage(document.Pages[j]);
}
}
}
mergedDocument.Save(savePath);
}
SelectedDocuments is a list which holds a bunch of file paths to the selected PDF files.
I ended up using iTextSharp instead with the following code to avoid memory issues:
var parentDirectory = Directory.GetParent(SelectedDocuments[0].FilePath);
var savePath = parentDirectory + "\\MergedDocument.pdf";
using (var fs = new FileStream(savePath, FileMode.Create))
{
using (var document = new Document())
{
using (var pdfCopy = new PdfCopy(document, fs))
{
document.Open();
for (var i = 0; i < SelectedDocuments.Count; i++)
{
using (var pdfReader = new PdfReader(SelectedDocuments[i].FilePath))
{
for (var page = 0; page < pdfReader.NumberOfPages;)
{
pdfCopy.AddPage(pdfCopy.GetImportedPage(pdfReader, ++page));
}
}
}
}
}
}
Related
this is the code that i've written so far...
it doesnt do the job except re-write every line on the same file over and over again...
*RecordCntPerFile = 10K
*FileNumberName = 1 (file number one)
*Full File name should be something like this: 1_asci_split
string FileFullPath = DestinationFolder + "\\" + FileNumberName + FileNamePart + FileExtension;
using (System.IO.StreamReader sr = new System.IO.StreamReader(SourceFolder + "\\" + SourceFileName))
{
for (int i = 0; i <= (RecordCntPerFile - 1); i++)
{
using (StreamWriter sw = new StreamWriter(FileFullPath))
{
{ sw.Write(sr.Read() + "\n"); }
}
}
FileNumberName++;
}
Dts.TaskResult = (int)ScriptResults.Success;
}
If I understood correctly, you want to split a big file in smaller files with maximum of 10k lines. I see 2 problems on your code:
You never change the FullFilePath variable. So you will always rewrite on the same file
You always read and write the whole source file to the target file.
I rewrote your code to fit the behavior I said earlier. You just have to modify the strings.
int maxRecordsPerFile = 10000;
int currentFile = 1;
using (StreamReader sr = new StreamReader("source.txt"))
{
int currentLineCount = 0;
List<string> content = new List<string>();
while (!sr.EndOfStream)
{
content.Add(sr.ReadLine());
if (++currentLineCount == maxRecordsPerFile || sr.EndOfStream)
{
using (StreamWriter sw = new StreamWriter(string.Format("file{0}.txt", currentFile)))
{
foreach (var line in content)
sw.WriteLine(line);
}
content = new List<string>();
currentFile++;
currentLineCount = 0;
}
}
}
Of course you can do better than that, as you don't need to create that string and do that foreach loop. I just made this quick example to give you the idea. To improve the performance is up to you
I have to convert into a single pdf a large number (but undefined) pdf into one for this, I'm using the code PDFsharp here.
// Get some file names
string[] files = filesToPrint.ToArray();
// Open the output document
PdfDocument outputDocument = new PdfDocument();
PdfPage newPage;
int nProcessedFile = 0;
int nMemoryFile = 5;
int nStepConverted = 0;
String sNameLastCombineFile = "";
// Iterate files
foreach (string file in files)
{
// Open the document to import pages from it.
PdfDocument inputDocument = PdfReader.Open(file, PdfDocumentOpenMode.Import);
// Iterate pages
int count = inputDocument.PageCount;
for (int idx = 0; idx < count; idx++)
{
// Get the page from the external document...
PdfPage page = inputDocument.Pages[idx];
// ...and add it to the output document.
outputDocument.AddPage(page);
}
nProcessedFile++;
if (nProcessedFile >= nMemoryFile)
{
//nProcessedFile = 0;
//nStepConverted++;
//sNameLastCombineFile = "ConcatenatedDocument" + nStepConverted.ToString() + " _tempfile.pdf";
//outputDocument.Save(sNameLastCombineFile);
//outputDocument.Close();
}
}
// Save the document...
const string filename = "ConcatenatedDocument1_tempfile.pdf";
outputDocument.Save(filename);
// ...and start a viewer.
Process.Start(filename);
For small numbers of files the code works but then at some point
generates an exception of out of memory
is there a solution?
p.s
I was thinking of saving the files in step and then the remaining aggiungingere so liebrare memory but I can not find the way.
UPDATE1:
if (nProcessedFile >= nMemoryFile)
{
nProcessedFile = 0;
//nStepConverted++;
sNameLastCombineFile = "ConcatenatedDocument" + nStepConverted.ToString() + " _tempfile.pdf";
outputDocument.Save(sNameLastCombineFile);
outputDocument.Close();
outputDocument = PdfReader.Open(sNameLastCombineFile,PdfDocumentOpenMode.Modify);
}
UPDATE 2 versione 1.32
Complete example
Error on line:
PdfDocument inputDocument = PdfReader.Open(file, PdfDocumentOpenMode.Import);
Text error:
Cannot handle iref streams. The current implementation of PDFsharp cannot handle this PDF feature introduced with Acrobat 6.
using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
List<String> filesToPrint = new List<string>();
filesToPrint = Directory.GetFiles(#"D:\Downloads\RACCOLTA\FILE PDF", "*.pdf").ToList();
// Get some file names
string[] files = filesToPrint.ToArray();
// Open the output document
PdfDocument outputDocument = new PdfDocument();
PdfPage newPage;
int nProcessedFile = 0;
int nMemoryFile = 5;
int nStepConverted = 0;
String sNameLastCombineFile = "";
try
{
// Iterate files
foreach (string file in files)
{
// Open the document to import pages from it.
PdfDocument inputDocument = PdfReader.Open(file, PdfDocumentOpenMode.Import);
// Iterate pages
int count = inputDocument.PageCount;
for (int idx = 0; idx < count; idx++)
{
// Get the page from the external document...
PdfPage page = inputDocument.Pages[idx];
// ...and add it to the output document.
outputDocument.AddPage(page);
}
nProcessedFile++;
if (nProcessedFile >= nMemoryFile)
{
nProcessedFile = 0;
//nStepConverted++;
sNameLastCombineFile = "ConcatenatedDocument" + nStepConverted.ToString() + " _tempfile.pdf";
outputDocument.Save(sNameLastCombineFile);
outputDocument.Close();
inputDocument = PdfReader.Open(sNameLastCombineFile , PdfDocumentOpenMode.Modify);
}
}
// Save the document...
const string filename = "ConcatenatedDocument1_tempfile.pdf";
outputDocument.Save(filename);
// ...and start a viewer.
Process.Start(filename);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Console.ReadKey();
}
}
}
}
UPDATE3
Code that generate exception out of memory
int count = inputDocument.PageCount;
for (int idx = 0; idx < count; idx++)
{
// Get the page from the external document...
newPage = inputDocument.Pages[idx];
// ...and add it to the output document.
outputDocument.AddPage(newPage);
newPage.Close();
}
I can not exactly which row general exception
I had a simular issue, saving, closing and reopening the PdfDocument did not really help.
I am adding al lot (100+) large (upto 5Mb) images (tiff, jpg, etc) to a pdf document where every images has its own page. It crashed around image #50. After the save-close-reopen it did finish the whole document but was still getting close to max memory, around 3Gb. Some more images and it would still crash.
After more refining, I implemented a using for the XGraphics object, it was a little better again but not much.
The big step forward was disposing of the XImage within the loop! After that the application never used more than 100-200Kb, I removed the save-close-reopen for the PdfDocument and it was no problem.
After saving and closing outputDocument (the code is commented out in your snippet), you have to open outputDocument again, using PdfDocumentOpenMode.Modify.
It could help to add using(...) for the inputDocument.
If your code is running as a 32-bit process, then switching to 64 bit will allow your process to use more than 2 GB of RAM (assuming your computer has more than 2 GB RAM).
Update: The message "Cannot handle iref streams" means you have to use PDFsharp 1.50 Prerelease, available on NuGet.
When I try to save each page as GIF using ABCpdf, only the first page is saved.
For example: I have a PDF that has 3 pages. I use ABCpdf to render each page to a stream, which is saved to disk. When I open the files in my destination folder, all 3 files show the first page content.
Here's my code:
using (Doc theDoc = new Doc())
{
XReadOptions options = new XReadOptions { ReadModule = ReadModuleType.Pdf };
theDoc.Read(inputbytearray, options);
using (MemoryStream ms = new MemoryStream())
{
theDoc.Rendering.DotsPerInch = 150;
int n = theDoc.PageCount;
for (int i = 1; i <= n; i++)
{
Guid FileName = Guid.NewGuid();
theDoc.Rect.String = theDoc.CropBox.String;
theDoc.Rendering.SaveAppend = (i != 1);
theDoc.Rendering.SaveCompression = XRendering.Compression.G4;
theDoc.PageNumber = i;
theDoc.Rendering.Save(string.Format("{0}.gif", FileName), ms);
using (var streamupload = new MemoryStream(ms.GetBuffer(), writable: false))
{
_BlobStorageService.UploadfromStream(FileName.ToString(), streamupload, STR_Gif, STR_Imagegif);
}
}
// theDoc.Clear();
}
}
The Rendering.SaveAppend property is only applicable when saving TIFF images. For GIFs you would need to save a separate image for each PDF page.
private void button1_Click(object sender, System.EventArgs e)
{
string theDir = Directory.GetParent(Directory.GetCurrentDirectory()).Parent.FullName + #"\files\";
// Create test PDF
using (Doc doc = new Doc())
{
for (int i = 1; i <= 3; i++)
{
doc.Page = doc.AddPage();
doc.AddHtml("<font size=24>PAGE " + i.ToString());
}
doc.Save(Path.Combine(theDir, "test.pdf"));
}
// Save PDF pages to GIF streams
using (Doc doc = new Doc())
{
doc.Read(Path.Combine(theDir, "test.pdf"));
for (int i = 1; i <= doc.PageCount; i++)
{
doc.PageNumber = i;
doc.Rect.String = doc.CropBox.String;
using (MemoryStream ms = new MemoryStream())
{
doc.Rendering.Save("dummy.gif", ms);
using (FileStream fs = File.Create(Path.Combine(theDir, "p" + i.ToString() + ".gif")))
{
ms.Seek(0, SeekOrigin.Begin);
ms.CopyTo(fs);
}
}
}
}
}
I am trying to download a list of files, but isn't really sure of how to proceed.
As the topic says, I am using DropNet, and this is the procedure I am trying to download the files with:
Get a list of all files in my applications dedicated folder and store them in a List as strings.
Then trying the following:
foreach (string file in files)
{
_client.GetFileAsync("/" +file,
(response) =>
{
using(FileStream fs = new FileStream(path +file +".gttmp", FileMode.Create))
{
for(int i = 0; i < response.RawBytes.Length; i++)
{
fs.WriteByte(response.RawBytes[i]);
}
fs.Seek(0, SeekOrigin.Begin);
for(int i = 0; i < response.RawBytes.Length; i++)
{
if(response.RawBytes[i] != fs.ReadByte())
{
MessageBox.Show("Error writing data for " +file);
return;
}
}
}
},
(error) =>
{
MessageBox.Show("Could not download file " +file, "Error!");
});
}
Unfortunately it doesn't seem to work at all.
Anyone using DropNet and could suggest me something that would work?
Used synronous method instead:
foreach (string file in files)
{
var fileBytes = _client.GetFile("/" + file);
using (FileStream fs = new FileStream(path +file + ".gttmp", FileMode.Create))
{
for (int i = 0; i < fileBytes.Length; i++)
{
fs.WriteByte(fileBytes[i]);
}
fs.Seek(0, SeekOrigin.Begin);
for (int i = 0; i < fileBytes.Length; i++)
{
if (fileBytes[i] != fs.ReadByte())
{
MessageBox.Show("Error writing data for " + file);
break;
}
}
}
}
Your code to download file asynchronously works fine, I tried in following way and it proceeds without error.
client.GetFileAsync("/novemberrain.mp3",
(response) =>
{
using (FileStream fs = new FileStream(#"D:\novemberrain.mp3", FileMode.Create))
{
for (int i = 0; i < response.RawBytes.Length; i++)
{
fs.WriteByte(response.RawBytes[i]);
}
}
MessageBox.Show("file downloaded");
},
(error) =>
{
MessageBox.Show("error downloading");
});
I have a pdf document that has form fields that I'm filling out programatically with c#. Depending on three conditions, I need to trim (delete) some of the pages from that document.
Is that possible to do?
for condition 1: I need to keep pages 1-4 but delete pages 5 and 6
for condition 2: I need to keep pages 1-4 but delete 5 and keep 6
for condition 3: I need to keep pages 1-5 but delete 6
Use PdfReader.SelectPages() combined with PdfStamper. The code below uses iTextSharp 5.5.1.
public void SelectPages(string inputPdf, string pageSelection, string outputPdf)
{
using (PdfReader reader = new PdfReader(inputPdf))
{
reader.SelectPages(pageSelection);
using (PdfStamper stamper = new PdfStamper(reader, File.Create(outputPdf)))
{
stamper.Close();
}
}
}
Then you call this method with the correct page selection for each condition.
Condition 1:
SelectPages(inputPdf, "1-4", outputPdf);
Condition 2:
SelectPages(inputPdf, "1-4,6", outputPdf);
or
SelectPages(inputPdf, "1-6,!5", outputPdf);
Condition 3:
SelectPages(inputPdf, "1-5", outputPdf);
Here's the comment from the iTextSharp source code on what makes up a page selection. This is in the SequenceList class which is used to process a page selection:
/**
* This class expands a string into a list of numbers. The main use is to select a
* range of pages.
* <p>
* The general systax is:<br>
* [!][o][odd][e][even]start-end
* <p>
* You can have multiple ranges separated by commas ','. The '!' modifier removes the
* range from what is already selected. The range changes are incremental, that is,
* numbers are added or deleted as the range appears. The start or the end, but not both, can be ommited.
*/
Instead of deleting pages in a document what you actually do is create a new document and only import the pages that you want to keep. Below is a full working WinForms app that does that (targetting iTextSharp 5.1.1.0). The last parameter to the function removePagesFromPdf is an array of pages to keep.
The code below works off of physical files but would be very easy to convert to something based on streams so that you don't have to write to disk if you don't want to.
using System;
using System.ComponentModel;
using System.IO;
using System.Linq;
using System.Windows.Forms;
using iTextSharp.text.pdf;
using iTextSharp.text;
namespace Full_Profile1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
//The files that we are working with
string sourceFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
string sourceFile = Path.Combine(sourceFolder, "Test.pdf");
string destFile = Path.Combine(sourceFolder, "TestOutput.pdf");
//Remove all pages except 1,2,3,4 and 6
removePagesFromPdf(sourceFile, destFile, 1, 2, 3, 4, 6);
this.Close();
}
public void removePagesFromPdf(String sourceFile, String destinationFile, params int[] pagesToKeep)
{
//Used to pull individual pages from our source
PdfReader r = new PdfReader(sourceFile);
//Create our destination file
using (FileStream fs = new FileStream(destinationFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document doc = new Document())
{
using (PdfWriter w = PdfWriter.GetInstance(doc, fs))
{
//Open the desitination for writing
doc.Open();
//Loop through each page that we want to keep
foreach (int page in pagesToKeep)
{
//Add a new blank page to destination document
doc.NewPage();
//Extract the given page from our reader and add it directly to the destination PDF
w.DirectContent.AddTemplate(w.GetImportedPage(r, page), 0, 0);
}
//Close our document
doc.Close();
}
}
}
}
}
}
Here is the code I use to copy all but the last page of an existing PDF. Everything is in memory streams. The variable pdfByteArray is a byte[] of the original pdf obtained using ms.ToArray(). pdfByteArray is overwritten with the new PDF.
PdfReader originalPDFReader = new PdfReader(pdfByteArray);
using (MemoryStream msCopy = new MemoryStream())
{
using (Document docCopy = new Document())
{
using (PdfCopy copy = new PdfCopy(docCopy, msCopy))
{
docCopy.Open();
for (int pageNum = 1; pageNum <= originalPDFReader.NumberOfPages - 1; pageNum ++)
{
copy.AddPage(copy.GetImportedPage(originalPDFReader, pageNum ));
}
docCopy.Close();
}
}
pdfByteArray = msCopy.ToArray();
I know it's an old post, Simply I extend the #chris-haas solution to the next level.
Delete the selected pages after that save them into the separate pdf file.
//ms is MemoryStream and fs is FileStream
ms.CopyTo(fs);
Save the Stream to a separate pdf file. 100% working without any error.
pageRange="5"
pageRange="2,15-20"
pageRange="1-5,15-20"
You can pass the pageRange vales like the above-given samples.
private void DeletePagesNew(string pageRange, string SourcePdfPath, string OutputPdfPath, string Password = "")
{
try
{
var pagesToDelete = new List<int>();
if (pageRange.IndexOf(",") != -1)
{
var tmpHold = pageRange.Split(',');
foreach (string nonconseq in tmpHold)
{
if (nonconseq.IndexOf("-") != -1)
{
var rangeHold = nonconseq.Split('-');
for (int i = Convert.ToInt32(rangeHold[0]), loopTo = Convert.ToInt32(rangeHold[1]); i <= loopTo; i++)
pagesToDelete.Add(i);
}
else
{
pagesToDelete.Add(Convert.ToInt32(nonconseq));
}
}
}
else if (pageRange.IndexOf("-") != -1)
{
var rangeHold = pageRange.Split('-');
for (int i = Convert.ToInt32(rangeHold[0]), loopTo1 = Convert.ToInt32(rangeHold[1]); i <= loopTo1; i++)
pagesToDelete.Add(i);
}
else
{
pagesToDelete.Add(Convert.ToInt32(pageRange));
}
var Reader = new PdfReader(SourcePdfPath);
int[] pagesToKeep;
pagesToKeep = Enumerable.Range(1, Reader.NumberOfPages).ToArray();
using (var ms = new MemoryStream())
{
using (var fs = new FileStream(OutputPdfPath, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (var doc = new Document())
{
using (PdfWriter w = PdfWriter.GetInstance(doc, fs))
{
doc.Open();
foreach (int p in pagesToKeep)
{
if (pagesToDelete.FindIndex(s => s == p) != -1)
{
continue;
}
// doc.NewPage()
// w.DirectContent.AddTemplate(w.GetImportedPage(Reader, p), 0, 0)
//
doc.SetPageSize(Reader.GetPageSize(p));
doc.NewPage();
PdfContentByte cb = w.DirectContent;
PdfImportedPage pageImport = w.GetImportedPage(Reader, p);
int rot = Reader.GetPageRotation(p);
if (rot == 90 || rot == 270)
{
cb.AddTemplate(pageImport, 0, -1.0f, 1.0f, 0, 0, Reader.GetPageSizeWithRotation(p).Height);
}
else
{
cb.AddTemplate(pageImport, 1.0f, 0, 0, 1.0f, 0, 0);
}
cb = default;
pageImport = default;
rot = default;
}
ms.CopyTo(fs);
fs.Flush();
doc.Close();
}
}
}
}
pagesToDelete = null;
Reader.Close();
Reader = default;
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}