I have a sql server db. In there are many, many rows. Each row has a column that contains a stored pdf.
The db is a gig in size. So we can expect roughly half that size is due to the pdfs.
now I have a requirement to join all those pdf's ... into 1 pdf. Don't ask why.
Can you suggest the best way forward and which component will be best suited for this job. There are many answers available:
How can I join two PDF's using iTextSharp?
Merge memorystreams to one itext document
How to merge multiple pdf files (generated in run time)?
as to how to join two (or more pdfs). But what I'm asking for is in terms of performance. We literally dealing with around 50 000 pdfs that need to be merged into 1 almighty pdf
[Edit Solution] Brought time to merge 1000 pdfs from 4m30s to 21s
public void MergePDFs(string targetPDF, string sourceDir)
{
using (FileStream stream = new FileStream(targetPDF, FileMode.Create))
{
var files = Directory.GetFiles(sourceDir);
Document pdfDoc = new Document(PageSize.A4);
PdfCopy pdf = new PdfCopy(pdfDoc, stream);
pdfDoc.Open();
Console.WriteLine("Merging files count: " + files.Length);
int i = 1;
var watch = System.Diagnostics.Stopwatch.StartNew();
foreach (string file in files)
{
Console.WriteLine(i + ". Adding: " + file);
pdf.AddDocument(new PdfReader(file));
i++;
}
if (pdfDoc != null)
pdfDoc.Close();
watch.Stop();
var elapsedMs = watch.ElapsedMilliseconds;
MessageBox.Show(elapsedMs.ToString());
}
}
I just did a C#/Winforms project with PDFSharp and merging images to PDFs and it worked phenomenally with a traditional folder structure. I imagine that it would work similarly with data stored PDFs so long as you can pull them into a memory stream first then merge them.
Some suggestions:
1) Recommend doing it in a multi-threaded environment so you can work on multiple PDFs at a time.
2) Open only what you need and close as soon as the operation is complete. So say you have three documents that need to be merged into one. Create a blank PDF. Open first into a memory stream, open blank. Append first to blank. Close first, save blank, close blank. Repeat for second and third. This way you control how much memory you are taking up at any one point in time. In this way I was able to append millions of images, but control memory usage.
3) Ensure you are using the Using statements when utilizing objects. This will help with memory cleanup and eliminate the need for calling garbage collector which is looked down upon.
4) Separate your business (work) from your UI as best you can so you can cancel the operation at any point in time, or view current status as it progresses through.
5) Log everything that is done so that you can go back and correct one-offs for the PDFs that didn't make it through the first pass.
Related
I have a routine that reads XPS documents and chops off pages to separate documents. Originally it read one document, decided how to chop it, closed it and wrote out the new files.
Features were added, this was causing headaches with cleaning up old files before running the routine and so I saved all the chopped pieces to be written out at the end.
ChoppedXPS is a dictionary, the key is the filename, the data is the FixedDocument prepared from the original:
foreach (String OneReport in ChoppedXPS.Keys)
{
File.Delete(OneReport);
using (XpsDocument TargetFile = new XpsDocument(OneReport, FileAccess.ReadWrite))
{
XpsDocumentWriter Writer = XpsDocument.CreateXpsDocumentWriter(TargetFile);
Writer.Write(ChoppedXPS[OneReport]);
Logger($"{OneReport} written to disk", 2);
}
Application.DoEvents();
}
If the FixedDocument being written out here contains graphics the source file is opened by the Writer.Write line and left open until the program is closed.
The XpsDocumentWriter does not seem to implement anything that can be used to clean it up.
(Yeah, that Application.DoEvents is ugly--this is an in-house program used by two people, it's not worth the hassle of making this run in the background and without it a big enough task can cause Windows to decide it's non-responsive and kill it. And, yes, I know how to indent--I took them out to make it all fit this screen.)
.Net 4.5, using some C# 8.0 features.
I found a workaround for this problem. I'm not going to try to post the whole thing as I had to change the whole data handling but the heart of it:
using (XPSDocument Source = new XPSDocument(SourceFile, FileAccess.Read)
{
[the using loop from my question]
}
I'm still hoping for understanding and something more appropriate than this approach.
Yes--this produces a warning that Source is unused, but the compiler isn't eliminating it so it does work.
Situation I need to solve:
My client has some extremely large .xlsx files that resemble a database table (each row is a record, cols are fields)
I need to help them process those files (search, filter, etc).
By large I mean the smallest of them has 1 million records.
What I have tried:
SheetJS, and NPOI: both libs only reply with a simple "file too large".
EPPlus: can read files up to some hundred K records, but when faced with actual file it just give me a System.OverflowException, my guess is that it's basically out of memory, because a 200MB xlsx file already took me 4GB of memory to read.
I didn't try Microsoft OleDB, but I'd rather avoid it, since I don't want to purchase Microsoft Office just for a job.
Due to confidentiality I cannot share the actual file, but you can easily create a similar structure with 60 cols (first name, last name, dob, etc), and about 1M records.
The question would be solved as soon as you can read an .xlsx file with that criteria, remove half of the records then write to another place without facing memory issue.
Time is not too much of an issue. User is willing to wait an hour or 2 for result if needed.
Memory seem to be the issue currently. This is a personal request, and the client's machine is a laptop capped at 8GB RAM.
csv is not an option here. My client has .xlsx input and need .xlsx output.
Language choice is preferably JS, C# for Python, since I already know how to create executable with them (well can't tell an accountant to learn terminal, can we?).
It would be great if there is a way to slowly read small chunks of data from the file row-by-row, but solutions I have found only read the entire file at the same time.
For reading Excel file I would recommend ExcelDataReader. It does very fine with reading large files. I personally tried 500k-1M:
using (var stream = File.Open("C:\\temp\\input.xlsx", FileMode.Open, FileAccess.Read))
{
using (var reader = ExcelReaderFactory.CreateReader(stream))
{
while (reader.Read())
{
for (var i = 0; i < reader.FieldCount; i++)
{
var value = reader.GetValue(i)?.ToString();
}
}
}
}
Writing data back in the same efficient way is more tricky. I finished up with creating my own SwiftExcel library that is extremely fast and efficient (there is a performance chart comparing to other Nuget libraries including EPPlus) as it does not use any XML-serialization and writes data directly to the file:
using (var ew = new ExcelWriter("C:\\temp\\test.xlsx"))
{
for (var row = 1; row <= 100; row++)
{
for (var col = 1; col <= 10; col++)
{
ew.Write($"row:{row}-col:{col}", col, row);
}
}
}
I have a large word docx file (more than 100 MB), and it contains a table, there is a requirement to add additional data in this table
I am using the following approach
using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(filepath, true))
{
var table = wordDocument.MainDocumentPart.Document.Body.Elements<Table>().First();
foreach (var row in data)
{
var tRow = new TableRow();
foreach (var cell in row)
{
var hCell = new TableCell();
var cPar = new Paragraph(new Run(new Text(cell)));
hCell.Append(cPar);
tRow.Append(hCell);
}
table.Append(tRow);
}
}
And it seems that the whole document has been loaded in memory. Is there any way to write to the file without loading the whole DOM structure, using for example SAX approach?
I've tried to do the same many times, but so far my conclusion is that, no, you can not.
I've tried many approaches, even using 3rd party tools like Aspose, but so far they've all required loading the full document into memory.
Which makes sense, considering you have to find the entry point, and how can you do that without evaluating the content, which requires loading the content?
So far, the method with the least extra overhead (yet using the most memory) is to use VSTO and so it through a Word AddIn, when the document is open anyway.
It gives the least extra overhead because the document is open anyway (the user has opened it in Word) - but this approach won't be useful if you need to do this without user action, or it has to be done serverside without launching a full Word application.
As of now, we are generating PDFs programmatically using crystal reports and saving it to database. The PDF document has barcode image in it. Each file is of size 120-150 KB.
Everything is running fine but lately we are facing problem with huge growth in database size and storage requirements. This is due to 100 - 1000 records being generated each day.
Is there any way to compress the PDF files and then store it. Any API/tools available that perform these without creating issue to the barcode.Can we gain much reduction in size after compression?
Or any alternative way of storing the data will be good?
Any suggestions on this would be highly appreciated.
Thanks,
Sveerap
Unfortunately, you won't gain much by compressing a PDF as it is already compressed.
Many compressed PDF files can be compressed further.
Size of a PDF file can usually be decreased by:
removing unused objects (if any)
removing extra whitespace characters from the file (not from the visual content)
using object streams (a PDF 1.5 feature)
I do not know how well Crystal Report compresses PDFs but you might want to try Docotic.Pdf library and the following code and see if your files can be compressed better.
public static void CompressExistingDocument(string original, string output)
{
using (PdfDocument pdf = new PdfDocument(original))
{
pdf.SaveOptions.Compression = PdfCompression.Flate;
pdf.SaveOptions.UseObjectStreams = true;
pdf.SaveOptions.RemoveUnusedObjects = true;
pdf.SaveOptions.WriteWithoutFormatting = true;
pdf.Save(output);
}
FileInfo originalFileInfo = new FileInfo(original);
FileInfo compressedFileInfo = new FileInfo(output);
MessageBox.Show(
String.Format("Original file size: {0} bytes;\r\nCompressed file size: {1} bytes",
originalFileInfo.Length, compressedFileInfo.Length));
System.Diagnostics.Process.Start(output);
}
Disclaimer: I work for the vendor of the library.
I have a large (6 pages, 222 fields) fillable PDF that I am using as a template with iTextSharp PdfReader. When this object instantiates it takes 5 minutes or more. I have tried:
string pdfPath = Path.Combine(context.Server.MapPath("~/apps/ssgenpdf/App_Data"), "07-2011 Worksheets.pdf");
reader = new PdfReader(pdfPath);
alternatively I have tried reading the file into a memory stream and passing the memory stream to the PdfReader constructor. Additionally I have tried using:
reader = new PdfReader(new RandomAccessFileOrArray(pdfPath), null);
none of these alternatives show significant gains.
This is an ASP.Net app, and so my interim solution is to do this creation on application start and caching the reader, then I check to see if I get a valid reader from the cache and instantiate a new reader from that reader. Now I routinely see under 50 millisecond response from this approach.
My concern is that this does not seem scalable if others in my group want to use this "fillable PDF as template with iTextSharp" strategy. Does anyone have any suggestions for alternate strategies to balance performance with scalability?
Make sure the PDF you are using are same on the server as well as locally.sometime from source control they get corrupted.[I have face the issue with VSS with fillable pdf forms which i have created via nitro.]
Also better to ask the question and relevant forum http://forum.pdfsharp.net