Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I need to convert a pdf file into a jpeg using C#. And the solution (library) has to be free.
I have searched a lot of information but seems that I don't get anything clear.
I already tried itextsharp and pdfbox (but this, the pdf2image is only for java, I think) with no success.
I tried to extract the images from the pdf individually, but I have an error of invalid parameters when I try to extract the images... Seems that they have a strange encoding.
Anyone can recommend me any library to save a pdf into a jpeg? Examples will be very appreciated too.
The library pdfiumviewer might be helpful here. It is also available as nuget.
Create a new winforms app. Add nuget "PdfiumViewer" to it.
This will also add two native dll's named "pdfium.dll" in folders x86 and x64 to your project. Set "Copy to Output Directory" to "Copy Always".
Try out the following code (change paths to suit your setup).
try
{
using (var document = PdfiumViewer.PdfDocument.Load(#"input.pdf"))
{
var image = document.Render(0, 300, 300, true);
image.Save(#"output.png", ImageFormat.Png);
}
}
catch (Exception ex)
{
// handle exception here;
}
Edit 2: Changed code to show that page index is 0 based as pointed out in comment by S.C. below
Edit 1: Updated solution
Have you tried pdfsharp?
This link might be helpful
This is how I did it with PDFLibNet:
public void ConvertPDFtoHojas(string filename, String dirOut)
{
PDFLibNet.PDFWrapper _pdfDoc = new PDFLibNet.PDFWrapper();
_pdfDoc.LoadPDF(filename);
for (int i = 0; i < _pdfDoc.PageCount; i++)
{
Image img = RenderPage(_pdfDoc, i);
img.Save(Path.Combine(dirOut, string.Format("{0}{1}.jpg", i,DateTime.Now.ToString("mmss"))));
}
_pdfDoc.Dispose();
return;
}
public Image RenderPage(PDFLibNet.PDFWrapper doc, int page)
{
doc.CurrentPage = page + 1;
doc.CurrentX = 0;
doc.CurrentY = 0;
doc.RenderPage(IntPtr.Zero);
// create an image to draw the page into
var buffer = new Bitmap(doc.PageWidth, doc.PageHeight);
doc.ClientBounds = new Rectangle(0, 0, doc.PageWidth, doc.PageHeight);
using (var g = Graphics.FromImage(buffer))
{
var hdc = g.GetHdc();
try
{
doc.DrawPageHDC(hdc);
}
finally
{
g.ReleaseHdc();
}
}
return buffer;
}
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 months ago.
Improve this question
I've got a request from a stakeholder who wants us to automate the following procedure.
Go to the CMS website (In-house application)
Take a picture of the report.
Send an email to stakeholders with the reports attached.
Note: This procedure must be repeated on a daily basis.
And I'm not sure which project to choose for the above need; at the moment, all I can think of is a Console Application, but I'm not sure much about it.
Any assistance would be much appreciated.
Code For Screenshot - Selenium C#
public class ScreenShotRepository
{
public static void TakeScreenShot(IWebDriver Driver, string filename, List<string> text = null)
{
var bytesArr = Driver.TakeScreenshot(new VerticalCombineDecorator(new ScreenshotMaker()));
var screenshotImage = (System.Drawing.Image)((new ImageConverter()).ConvertFrom(bytesArr));
WriteToPDF(new List<System.Drawing.Image>() { screenshotImage }, filename, text);
}
public static void WriteToPDF(List<System.Drawing.Image> screenshots, string filename, List<string> text)
{
var fileStream = new FileStream(filename, FileMode.Create, FileAccess.Write, FileShare.None);
var document = new Document(new iTextSharp.text.Rectangle(0, 0, screenshots[0].Width, screenshots[0].Height), 0, 0, 0, 0);
var writer = PdfWriter.GetInstance(document, fileStream);
document.Open();
var content = writer.DirectContent;
var font = BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED);
for (int i = 0; i < screenshots.Count; i++)
{
var image = iTextSharp.text.Image.GetInstance(screenshots[i], screenshots[i].RawFormat);
document.Add(image);
WriteText(content, font, text);
if (i + 1 != screenshots.Count)
document.NewPage();
}
document.Close();
writer.Close();
}
public static void WriteText(PdfContentByte content, BaseFont font, List<string> text)
{
content.BeginText();
content.SetColorFill(BaseColor.GREEN);
content.SetFontAndSize(font, 40);
for (int j = 0; j < text.Count; j++)
content.ShowTextAligned(Element.ALIGN_LEFT, text[j].ToString(), 50, 50 + 50 * j, 0);
content.EndText();
}
}
You could make this a Windows Service, because of the daily call requirement.
However, the simplest way is indeed a console application that you schedule to run using your operating systems task scheduler.
And as far as the requirements go, why can't the reporting system output a PDF? Taking a screenshot of another software is already a really makeshift solution if it were third-party, taking screenshots of your own reporting software just says whoever programs the inhouse CMS system is... not up to the task if there is a requirement to automate it outside of their domain.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I need to read PDF and convert it in a .Txt. I tried iTextSharp as free library, it was working fine but not compatible with .NET Core.
Code snippet in iTextSharp
string prevPage = "";
for (int page = 5; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
var s = PdfTextExtractor.GetTextFromPage(reader, page, its);
if (prevPage != s) sb.Append(s);
prevPage = s;
}
reader.Close();
Also, I tried iTextSharp.LGPLv2.Core but it does not work as well as the other one, and the results are not accurate.
One of the downsides iTextSharp.LGPLv2.Core is that it does not support encoding and results in noise in the extracted text of the PDF
My stringbuilder looks like the image below:
Approach: PDFPig (Apache:2.0 License)
Install Nuget Package PdfPig
Tested on .Net Core 3.1
using (var stream = File.OpenRead(pdfPath1))
using (UglyToad.PdfPig.PdfDocument document = UglyToad.PdfPig.PdfDocument.Open(stream))
{
var page = document.GetPage(2);
return string.Join(" ", page.GetWords());
}
Approach: iTextSharp.LGPLv2.Core(GNU General Public License)
Install Nuget iTextSharp.LGPLv2.Core
It is an unofficial port of the last LGPL version of the iTextSharp (V4.1.6) to .NET Core.
Tested on .Net Core 3.1
var reader = new PdfReader(pdfPath1);
var streamBytes = reader.GetPageContent(1);
var tokenizer = new PrTokeniser(new RandomAccessFileOrArray(streamBytes));
var sb = new StringBuilder();
while (tokenizer.NextToken())
{
if (tokenizer.TokenType == PrTokeniser.TK_STRING)
{
var currentText = tokenizer.StringValue;
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
sb.Append(tokenizer.StringValue);
}
}
Console.WriteLine("Extracted text "+sb);
Approach: GrapeCity.Documents.PDF(Licensed)
Install Nuget-Package *GrapeCity.Documents.Pdf
Is crossplatform library allows for creation, modification and analysis of PDF docs
Tested on .Net Core 3.1
var doc = new GcPdfDocument();
FileStream fs = new FileStream(pdfPath1, FileMode.Open, FileAccess.ReadWrite);
doc.Load(fs);
//To extract Page 1
var tmap_page2 = doc.Pages[0].GetTextMap();
tmap_page2.GetFragment(out TextMapFragment newFragment, out string Extractedtext);
Console.WriteLine("Extracted Text: \n\n" +Extractedtext);
I have a c# class that takes an HTML and converts it to PDF using wkhtmltopdf.
As you will see below, I am generating 3 PDFs - Landscape, Portrait, and combined of the two.
The properties object contains the html as a string, and the argument for landscape/portrait.
System.IO.MemoryStream PDF = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file = new System.IO.FileStream("abc_landscape.pdf", System.IO.FileMode.Create);
PDF.Position = 0;
properties.IsHorizontalOrientation = false;
System.IO.MemoryStream PDF_portrait = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file_portrait = new System.IO.FileStream("abc_portrait.pdf", System.IO.FileMode.Create);
PDF_portrait.Position = 0;
System.IO.MemoryStream finalStream = new System.IO.MemoryStream();
PDF.CopyTo(finalStream);
PDF_portrait.CopyTo(finalStream);
System.IO.FileStream file_combined = new System.IO.FileStream("abc_combined.pdf", System.IO.FileMode.Create);
try
{
PDF.WriteTo(file);
PDF.Flush();
PDF_portrait.WriteTo(file_portrait);
PDF_portrait.Flush();
finalStream.WriteTo(file_combined);
finalStream.Flush();
}
catch (Exception)
{
throw;
}
finally
{
PDF.Close();
file.Close();
PDF_portrait.Close();
file_portrait.Close();
finalStream.Close();
file_combined.Close();
}
The PDFs "abc_landscape.pdf" and "abc_portrait.pdf" generate correctly, as expected, but the operation fails when I try to combine the two in a third pdf (abc_combined.pdf).
I am using MemoryStream to preform the merge, and at the time of debug, I can see that the finalStream.length is equal to the sum of the previous two PDFs. But when I try to open the PDF, I see the content of just 1 of the two PDFs.
The same can be seen below:
Additionally, when I try to close the "abc_combined.pdf", I am prompted to save it, which does not happen with the other 2 PDFs.
Below are a few things that I have tried out already, to no avail:
Change CopyTo() to WriteTo()
Merge the same PDF (either Landscape or Portrait one) with itself
In case it is required, below is the elaboration of the GetPdfStream() method.
var htmlStream = new MemoryStream();
var writer = new StreamWriter(htmlStream);
writer.Write(htmlString);
writer.Flush();
htmlStream.Position = 0;
return htmlStream;
Process process = Process.Start(psi);
process.EnableRaisingEvents = true;
try
{
process.Start();
process.BeginErrorReadLine();
var inputTask = Task.Run(() =>
{
htmlStream.CopyTo(process.StandardInput.BaseStream);
process.StandardInput.Close();
});
// Copy the output to a memorystream
MemoryStream pdf = new MemoryStream();
var outputTask = Task.Run(() =>
{
process.StandardOutput.BaseStream.CopyTo(pdf);
});
Task.WaitAll(inputTask, outputTask);
process.WaitForExit();
// Reset memorystream read position
pdf.Position = 0;
return pdf;
}
catch (Exception ex)
{
throw ex;
}
finally
{
process.Dispose();
}
Merging pdf in C# or any other language is not straight forward with out using 3rd party library.
I assume your requirement for not using library is that most Free libraries, nuget packages has limitation or/and cost money for commercial use.
I have made research and found you an Open Source library called PdfClown with nuget package, it is also available for Java. It is Free with out limitation (donate if you like). The library has a lot of features. One such you can merge 2 or more documents to one document.
I supply my example that take a folder with multiple pdf files, merged it and save it to same or another folder. It is also possible to use MemoryStream, but I do not find it necessary in this case.
The code is self explaining, the key point here is using SerializationModeEnum.Incremental:
public static void MergePdf(string srcPath, string destFile)
{
var list = Directory.GetFiles(Path.GetFullPath(srcPath));
if (string.IsNullOrWhiteSpace(srcPath) || string.IsNullOrWhiteSpace(destFile) || list.Length <= 1)
return;
var files = list.Select(File.ReadAllBytes).ToList();
using (var dest = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(files[0])))
{
var document = dest.Document;
var builder = new org.pdfclown.tools.PageManager(document);
foreach (var file in files.Skip(1))
{
using (var src = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(file)))
{ builder.Add(src.Document); }
}
dest.Save(destFile, SerializationModeEnum.Incremental);
}
}
To test it
var srcPath = #"C:\temp\pdf\input";
var destFile = #"c:\temp\pdf\output\merged.pdf";
MergePdf(srcPath, destFile);
Input examples
PDF doc A and PDF doc B
Output example
Links to my research:
https://csharp-source.net/open-source/pdf-libraries
https://sourceforge.net/projects/clown/
https://www.oipapio.com/question-3526089
Disclaimer: A part of this answer is taken from my my personal web site https://itbackyard.com/merge-multiple-pdf-files-to-one-pdf-file-in-c/ with source code to github.
This answer from Stack Overflow (Combine two (or more) PDF's) by Andrew Burns works for me:
using (PdfDocument one = PdfReader.Open("pdf 1.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument two = PdfReader.Open("pdf 2.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument outPdf = new PdfDocument())
{
CopyPages(one, outPdf);
CopyPages(two, outPdf);
outPdf.Save("file1and2.pdf");
}
void CopyPages(PdfDocument from, PdfDocument to)
{
for (int i = 0; i < from.PageCount; i++)
{
to.AddPage(from.Pages[i]);
}
}
That's not quite how PDFs work. PDFs are structured files in a specific format.
You can't just append the bytes of one to the other and expect the result to be a valid document.
You're going to have to use a library that understands the format and can do the operation for you, or developing your own solution.
PDF files aren't just text and images. Behind the scenes there is a strict file format that describes things like PDF version, the objects contained in the file and where to find them.
In order to merge 2 PDFs you'll need to manipulate the streams.
First you'll need to conserve the header from only one of the files. This is pretty easy since it's just the first line.
Then you can write the body of the first page, and then the second.
Now the hard part, and likely the part that will convince you to use a library, is that you have to re-build the xref table. The xref table is a cross reference table that describes the content of the document and more importantly where to find each element. You'd have to calculate the byte offset of the second page, shift all of the elements in it's xref table by that much, and then add it's xref table to the first. You'll also need to ensure you create objects in the xref table for the page break.
Once that's done, you need to re-build the document trailer which tells an application where the various sections of the document are among other things.
See https://resources.infosecinstitute.com/pdf-file-format-basic-structure/
This is not trivial and you'll end up re-writing lots of code that already exists.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Can I merge two or more PDFs in asp.net? I know I can do Word and Excel files using interop. But can I merge PDFs?
Please suggest any suggestions or any links.
Try iTextSharp:
iTextSharp is a C# port of iText, and open source Java library for
PDF generation and manipulation. It can be used to create PDF
documents from scratch, to convert XML to PDF (using the extra XFA
Worker DLL), to fill out interactive PDF forms, to stamp new content
on existing PDF documents, to split and merge existing PDF documents,
and much more.
Here's an article on how to do it.
using System.Text.RegularExpressions;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using iTextSharp.text;
//Call this method in main with parameter
public static void MergePages(string outputPdfPath, string[] lstFiles)
{
PdfReader reader = null;
Document sourceDocument = null;
PdfCopy pdfCopyProvider = null;
PdfImportedPage importedPage;
sourceDocument = new Document();
pdfCopyProvider = new PdfCopy(sourceDocument,
new System.IO.FileStream(outputPdfPath, System.IO.FileMode.Create));
sourceDocument.Open();
try
{
for (int f = 0; f < lstFiles.Length - 1; f++)
{
int pages = 1;
reader = new PdfReader(lstFiles[f]);
//Add pages of current file
for (int i = 1; i <= pages; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
reader.Close();
}
sourceDocument.Close();
}
catch (Exception ex)
{
throw ex;
}
}
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
How do I extract an image from a pdf file, using c#? Thanks!
You could use iTextSharp. Here's an example.
Docotic.Pdf library can be used to extract images from PDFs.
Here is a sample that shows how to iterate trough pages and extract all images from each PDF page:
static void ExtractImagesFromPdfPages()
{
string path = "";
using (PdfDocument pdf = new PdfDocument(path))
{
for (int i = 0; i < pdf.Pages.Count; i++)
{
for (int j = 0; j < pdf.Pages[i].Images.Count; j++)
{
string imageName = string.Format("page{0}-image{1}", i, j);
string imagePath = pdf.Pages[i].Images[j].Save(imageName);
}
}
}
}
The library won't resample images. It will save them exactly the same as in PDF.
Disclaimer: I work for Bit Miracle, vendor of the library.