Reduce pdf size in .Net

Reduce pdf size in .Net - c#

I have a pdf template of size 230KB. In my WebAPI for multiple users, taking copy of that template pushing data to it, and merging using iTextsharp library. For 1500 users, total file size is reaching up to 320 MB.
I tried using BitMiracle, it reduced the file size to 160 MB. But it is still a large file.
I used acrobat Pro and used Save as Other option Reduced Size PDF, it reduced file size to 25 MB.
I want to decrease the file size to 25MB in my WebAPI using C# which will be hosted on server later.
As user is not supposed to edit that PDF, he will just store it as a record. Can i generate a post script file and then use acrobat distiller to decrease the size?If yes, how can I do it?
I am using ghostscript.Net. Wrote this method, it is not throwing any error. But i am unable to find the path of generated postscript file
public void convertToPs(string file)
{
try
{
Process printProcess = new Process();
printProcess.StartInfo.FileName = file;
printProcess.StartInfo.Verb = "printto";
printProcess.StartInfo.Arguments = "\"Ghostscript PDF\"";
printProcess.StartInfo.WindowStyle = ProcessWindowStyle.Hidden;
printProcess.StartInfo.CreateNoWindow = true;
printProcess.Start();
// Wait until the PostScript file is created
try
{
printProcess.WaitForExit();
}
catch (InvalidOperationException) { }
printProcess.Dispose();
}
catch (Exception ex)
{
throw ex;
}
}
Please help

Does the template has embedded Fonts? Probably the merging doesn't combine those fonts. If you don't need embedded Fonts you could remove them. Adobe does some good work in combining embedded fonts.
If you want you can send me such a big pdf document, so that i can understand, why this file is getting so big. I am a developer of a PDF library and getting the smallest PDF is one interesting usecase i am working on.

Related

Printing PDF programmatically bloats file size

Ok, So I've been given a task to build a light weight printing feature that will replace a third party tool that costs a significant amount of money,and not only that, has far too many features.
I've managed to build a little system that polls some data and calls an endpoint on an on-premise MVC app, which in turn prints the document.
All is great, but I'm really struggling to figure out why the PDF file size bloats when hitting the Print Queue.
Currently the File size is 822KB when I print manually via Adobe the PDF is compressed to 342KB
BUT using the system it bloats to an astonishing 4.22MB
To note I am using the PDFium SDK Nuget package to take away some of the heavy lifting. Having said that, I do utilize System.Drawing.Printing to craft settings to pass to PDFium.
A little code to demonstrate printing:
public bool PrintPDF(string printer,
string filePath)
{
try
{
var printerSettings = new PrinterSettings
{
PrinterName = "Hewlett-Packard HP LaserJet P2015 Series",
Copies = 1,
};
using (var document = PdfDocument.Load(#"C:\folder\Documentation\test.pdf"))
{
using (var printDocument = document.CreatePrintDocument())
{
printDocument.PrinterSettings = printerSettings;
printDocument.DefaultPageSettings = pageSettings;
printDocument.DocumentName = "test.pdf";
printDocument.PrintController = new StandardPrintController();
printDocument.Print();
}
}
return true;
}
catch(System.Exception ex)
{
new Email().SendEmail("", "TEST ERR", ex.Message, "email address");
return false;
}
}
At the moment I'd be happy if it printed the physical size (822KB) rather than bloating it.
Id really appreciate some guidance and a nudge in the right direction.

PDF is (usually) a vector representation of the page, its a page description. PDF can contain bitmap data as well, but for text and line art its usually vector, and white space simply isn't included in the description at all.
When you print, then behind the scenes the application creates a device context compatible with the printer you select, replays the drawing commands it used to draw the content on the display, and then tells the printer context to print.
That causes the device driver to be handed the GDI commands to draw the page. Depending on the printer type (ie what page description language it understands) the device driver can simply pass on the commands (for a GDI printer), convert them to a high level vector representation (like PostScript) or render them to a bitmap. Some drivers may do a combination of these approaches. The result is then sent to the printer.
The Adobe PDF 'printer' works by co-opting the Windows PostScript printer driver, which converts GDI commands into vector PostScript operations, which are easily turned into vector PDF operations, resulting in a small representation of the page.
It sounds to me like your printer (or possibly printer driver) is 'dumb' and wants, or is being sent, a big bitmap. Once upon a time, in the days when printers ran on serial interfaces and 9600 baud was fast, it was worth keeping the file size small and having the printer be smart, because it took a long time to send the data. Nowadays, that's less of a concern, even several megabytes can transfer rapidly, and if you send a pre-rendered bitmap to the printer, the printer can be dumb and still print fast, because all it has to do is transfer the bits.
You haven't really said what you mean when you "print manually using Adobe" or "use the system" so I can't tell you more than that, but my guess would be that your big PDF simply contains a large (compressed) image.

c# printing through PDF drivers, print to file option will output PS instead of PDF

After struggling whole day, I identified the issue but this didn't solve my problem.
On short:
I need to open a PDF, convert to BW (grayscale), search some words and insert some notes nearby found words. At a first look it seems easy but I discovered how hard PDF files are processed (having no "words" concepts and so on).
Now the first task, converting to grayscale just drove me crazy. I didn't find a working solution either commercial or free. I came up with this solution:
open the PDF
print with windows drivers, some free PDF printers
This is quite ugly since I will force the C# users to install such 3'rd party SW but.. that is fpr the moment. I tested FreePDF, CutePDF and PDFCreator. All of them are working "stand alone" as expected.
Now when I tried to print from C#, obviously, I don't want the print dialog, just select BW option and print (aka. convert)
The following code just uses a PDF library, shown for clarity only.
Aspose.Pdf.Facades.PdfViewer viewer = new Aspose.Pdf.Facades.PdfViewer();
viewer.BindPdf(txtPDF.Text);
viewer.PrintAsGrayscale = true;
//viewer.RenderingOptions = new RenderingOptions { UseNewImagingEngine = true };
//Set attributes for printing
//viewer.AutoResize = true; //Print the file with adjusted size
//viewer.AutoRotate = true; //Print the file with adjusted rotation
viewer.PrintPageDialog = true; //Do not produce the page number dialog when printing
////PrinterJob printJob = PrinterJob.getPrinterJob();
//Create objects for printer and page settings and PrintDocument
System.Drawing.Printing.PrinterSettings ps = new System.Drawing.Printing.PrinterSettings();
System.Drawing.Printing.PageSettings pgs = new System.Drawing.Printing.PageSettings();
//System.Drawing.Printing.PrintDocument prtdoc = new System.Drawing.Printing.PrintDocument();
//prtdoc.PrinterSettings = ps;
//Set printer name
//ps.PrinterName = prtdoc.PrinterSettings.PrinterName;
ps.PrinterName = "CutePDF Writer";
ps.PrintToFile = true;
ps.PrintFileName = #"test.pdf";
//
//ps.
//Set PageSize (if required)
//pgs.PaperSize = new System.Drawing.Printing.PaperSize("A4", 827, 1169);
//Set PageMargins (if required)
//pgs.Margins = new System.Drawing.Printing.Margins(0, 0, 0, 0);
//Print document using printer and page settings
viewer.PrintDocumentWithSettings(ps);
//viewer.PrintDocument();
//Close the PDF file after priting
What I discovered and seems to be little explained, is that if you select
ps.PrintToFile = true;
no matter C# PDF library or PDF printer driver, Windows will just skip the PDF drivers and instead of PDF files will output PS (postscript) ones which obviously, will not be recognized by Adobe Reader.
Now the question (and I am positive that others who may want to print PDFs from C# may be encountered) is how to print to CutePDF for example and still suppress any filename dialog?
In other words, just print silently with programmatically selected filename from C# application. Or somehow convince "print to file" to go through PDF driver, not Windows default PS driver.
Thanks very much for any hints.

I solved conversion to grayscale with a commercial component with this post and I also posted there my complete solution, in care anyone will struggle like me.
Converting PDF to Grayscale pdf using ABC PDF

SQL Server, C# and iTextSharp. Whats best way to join pdfs

I have a sql server db. In there are many, many rows. Each row has a column that contains a stored pdf.
The db is a gig in size. So we can expect roughly half that size is due to the pdfs.
now I have a requirement to join all those pdf's ... into 1 pdf. Don't ask why.
Can you suggest the best way forward and which component will be best suited for this job. There are many answers available:
How can I join two PDF's using iTextSharp?
Merge memorystreams to one itext document
How to merge multiple pdf files (generated in run time)?
as to how to join two (or more pdfs). But what I'm asking for is in terms of performance. We literally dealing with around 50 000 pdfs that need to be merged into 1 almighty pdf
[Edit Solution] Brought time to merge 1000 pdfs from 4m30s to 21s
public void MergePDFs(string targetPDF, string sourceDir)
{
using (FileStream stream = new FileStream(targetPDF, FileMode.Create))
{
var files = Directory.GetFiles(sourceDir);
Document pdfDoc = new Document(PageSize.A4);
PdfCopy pdf = new PdfCopy(pdfDoc, stream);
pdfDoc.Open();
Console.WriteLine("Merging files count: " + files.Length);
int i = 1;
var watch = System.Diagnostics.Stopwatch.StartNew();
foreach (string file in files)
{
Console.WriteLine(i + ". Adding: " + file);
pdf.AddDocument(new PdfReader(file));
i++;
}
if (pdfDoc != null)
pdfDoc.Close();
watch.Stop();
var elapsedMs = watch.ElapsedMilliseconds;
MessageBox.Show(elapsedMs.ToString());
}
}

I just did a C#/Winforms project with PDFSharp and merging images to PDFs and it worked phenomenally with a traditional folder structure. I imagine that it would work similarly with data stored PDFs so long as you can pull them into a memory stream first then merge them.
Some suggestions:
1) Recommend doing it in a multi-threaded environment so you can work on multiple PDFs at a time.
2) Open only what you need and close as soon as the operation is complete. So say you have three documents that need to be merged into one. Create a blank PDF. Open first into a memory stream, open blank. Append first to blank. Close first, save blank, close blank. Repeat for second and third. This way you control how much memory you are taking up at any one point in time. In this way I was able to append millions of images, but control memory usage.
3) Ensure you are using the Using statements when utilizing objects. This will help with memory cleanup and eliminate the need for calling garbage collector which is looked down upon.
4) Separate your business (work) from your UI as best you can so you can cancel the operation at any point in time, or view current status as it progresses through.
5) Log everything that is done so that you can go back and correct one-offs for the PDFs that didn't make it through the first pass.

Token not expected processing PDF using PDFsharp

I have two very similar pdf files. But if i try to process the first one it throws "Token '373071' was not expected" exception but for other one I can execute the code completely. Below is my code
class Program
{
static void Main(string[] args)
{
int bufferSize = 20480;
try
{
byte[] byteBuffer = new byte[bufferSize];
byteBuffer = File.ReadAllBytes(#"..\..\Fail.pdf");
MemoryStream coverSheetContent = new MemoryStream();
coverSheetContent.Write(byteBuffer, 0, byteBuffer.Length);
int t = PdfReader.TestPdfFile(coverSheetContent);
PdfReader.Open(coverSheetContent);
}
catch (Exception ex)
{
}
}
}
I've also added those PDF files. Well, those PDFs are row input for me I do not know where they got created or who does.
Fail.pdf
Success.pdf
There are very less information about PDFsharp please do help me to solve the problem.

The SAP tool that was used to create the PDF files adds many filling bytes after the "%%EOF" marker. PDFsharp up to version 1.32 expects the %%EOF marker within the trailing 130 bytes of the file.
You can modify the method ReadTrailer() in class Parser to search a larger area.
An implementation that searches the complete file can be found here:
http://forum.pdfsharp.net/viewtopic.php?p=583#p583
BTW: You can open the PDF like this:
var doc = PdfReader.Open(#"..\..\fail.pdf");
No need to allocate a buffer that will never be used, no stream needed.
Update: Since 2014 PDFsharp searches the complete PDF file if the "%%EOF" marker cannot be found near the end of the file. So if you are using PDFsharp 1.50 or newer it is no longer necessary to download and modify the code. Those who still use PDFsharp 1.32 or even older versions still have to modify the source.

.net compressing pdf generated with crystal reports

As of now, we are generating PDFs programmatically using crystal reports and saving it to database. The PDF document has barcode image in it. Each file is of size 120-150 KB.
Everything is running fine but lately we are facing problem with huge growth in database size and storage requirements. This is due to 100 - 1000 records being generated each day.
Is there any way to compress the PDF files and then store it. Any API/tools available that perform these without creating issue to the barcode.Can we gain much reduction in size after compression?
Or any alternative way of storing the data will be good?
Any suggestions on this would be highly appreciated.
Thanks,
Sveerap

Unfortunately, you won't gain much by compressing a PDF as it is already compressed.

Many compressed PDF files can be compressed further.
Size of a PDF file can usually be decreased by:
removing unused objects (if any)
removing extra whitespace characters from the file (not from the visual content)
using object streams (a PDF 1.5 feature)
I do not know how well Crystal Report compresses PDFs but you might want to try Docotic.Pdf library and the following code and see if your files can be compressed better.
public static void CompressExistingDocument(string original, string output)
{
using (PdfDocument pdf = new PdfDocument(original))
{
pdf.SaveOptions.Compression = PdfCompression.Flate;
pdf.SaveOptions.UseObjectStreams = true;
pdf.SaveOptions.RemoveUnusedObjects = true;
pdf.SaveOptions.WriteWithoutFormatting = true;
pdf.Save(output);
}
FileInfo originalFileInfo = new FileInfo(original);
FileInfo compressedFileInfo = new FileInfo(output);
MessageBox.Show(
String.Format("Original file size: {0} bytes;\r\nCompressed file size: {1} bytes",
originalFileInfo.Length, compressedFileInfo.Length));
System.Diagnostics.Process.Start(output);
}
Disclaimer: I work for the vendor of the library.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.