iTextSharp exception: PDF header signature not found - c#

I'm using iTextSharp to read the contents of PDF documents:
PdfReader reader = new PdfReader(pdfPath);
using (StringWriter output = new StringWriter())
{
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
reader.Close();
pdfText = output.ToString();
}
99% of the time it works just fine. However, there is this one PDF file that will sometimes throw this exception:
PDF header signature not found. StackTrace: at
iTextSharp.text.pdf.PRTokeniser.CheckPdfHeader() at
iTextSharp.text.pdf.PdfReader.ReadPdf() at
iTextSharp.text.pdf.PdfReader..ctor(String filename, Byte[]> ownerPassword) at
Reader.PDF.DownloadPdf(String url) in
What's annoying is that I can't always reproduce the error. Sometimes it works, sometimes it doesn't. Has anyone encountered this problem?

After some research, I've found that this problem relates to either a file being corrupted during PDF generation, or an error related to an object in the document that doesn't conform to the PDF standard as implemented in iTextSharp. It also seems to happen only when you read from a PDF file from disk.
I have not found a complete solution to the problem, but only a workaround. What I've done is read the PDF document using the PdfReader itextsharp object and see if an error or exception happens before reading the file in a normal operation.
So running something similar to this:
private bool IsValidPdf(string filepath)
{
bool Ret = true;
PdfReader reader = null;
try
{
reader = new PdfReader(filepath);
}
catch
{
Ret = false;
}
return Ret;
}

I found it was because I was calling new PdfReader(pdf) with the PDF stream position at the end of the file. By setting the position to zero it resolved the issue.
Before:
// Throws: InvalidPdfException: PDF header signature not found.
var pdfReader = new PdfReader(pdf);
After:
// Works correctly.
pdf.Position = 0;
var pdfReader = new PdfReader(pdf);

In my case, it was because I was calling a .json file, and iTextSharp only accepts pdf file obviously.

There is the possibility that you are opening the file with another method or program as was my case. Verify that nothing is working with your file, you can also use the resource monitor to verify which processes are working on your file.

Related

OpenXml Dotx to Docx Word Found Unreadable Content

I am trying to write code to fill a template's content controls then save it as a new file.
I found this very helpful entry Word OpenXml Word Found Unreadable Content And used the code there.
I copied the code from that post as shown below
public static MemoryStream ReadAllBytesToMemoryStream(string path)
{
byte[] buffer = File.ReadAllBytes(path);
var destStream = new MemoryStream(buffer.Length);
destStream.Write(buffer, 0, buffer.Length);
destStream.Seek(0, SeekOrigin.Begin);
return destStream;
}
public static void Generate()
{
using MemoryStream stream = ReadAllBytesToMemoryStream(#"c:\Templates\TemplateTest.dotx");
using (WordprocessingDocument wpd = WordprocessingDocument.Open(stream, true))
{
wpd.ChangeDocumentType(WordprocessingDocumentType.Document);
}
File.WriteAllBytes(#"c:\Templates\TemplateTestOutput.docx", stream.GetBuffer());
return;
}
It successfully creates the file, but the problem is, whenever I open the new docx file, it gives the "Word Found Unreadable Content" error. The template I made isn't complex, it just has 3 content controls with regular text for labels. I also tried copying a regular docx with just some lines of text, same error.
Whenever I click ok, on the Word Found Unreadable Content error, it shows the document just fine. I'm not sure what I'm doing wrong, I'm not even editing anything at this point.
Figured out a solution.
Instead of using File.WriteAllBytes
File.WriteAllBytes(#"C:\\Templates\TemplateTestOutput.docx", stream.GetBuffer());
I used the following code:
using (FileStream fileStream = new FileStream(#"C:\\Templates\TemplateTestOutput.docx", System.IO.FileMode.CreateNew))
{
stream.WriteTo(fileStream);
}

Reading PDF in net core with itext7 returns "\n\n\n\n\n...."

i have a netcore 3 app to read and split a PDF containing paychecks of some companies which i am working for.
This app ran pretty well since last builds... my the way, the PDF reader started to fail to parse the contents of any PDF.
PDF is built only with Italian words, no special chars. Few tables and a single logo. I'm not able to attach it due to privacy.
public PaycheckSplitter Read()
{
using (var reader = new PdfReader(new MemoryStream(this._stream)))
{
var doc = new PdfDocument(reader);
this.Paycheck = new PaychecksCollection();
for (int i = 1; i <= doc.GetNumberOfPages(); i++)
{
PdfPage page = doc.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, new LocationTextExtractionStrategy());
if (text.Contains(Consts.BpEnd)) break;
// trying to find something by regex... btw text contains only a sequence of \n\n\n\n...
string cf = Consts.CodFiscale.Match(text).Value;
this.Paychecks.Add(new Paycheck(cf), i);
}
doc.Close();
}
return this;
}
Anything i can do?
As far as i can see... the only and best way to have something to read a PDF text for free is iText7...

Unable to merge 2 PDFs using MemoryStream

I have a c# class that takes an HTML and converts it to PDF using wkhtmltopdf.
As you will see below, I am generating 3 PDFs - Landscape, Portrait, and combined of the two.
The properties object contains the html as a string, and the argument for landscape/portrait.
System.IO.MemoryStream PDF = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file = new System.IO.FileStream("abc_landscape.pdf", System.IO.FileMode.Create);
PDF.Position = 0;
properties.IsHorizontalOrientation = false;
System.IO.MemoryStream PDF_portrait = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file_portrait = new System.IO.FileStream("abc_portrait.pdf", System.IO.FileMode.Create);
PDF_portrait.Position = 0;
System.IO.MemoryStream finalStream = new System.IO.MemoryStream();
PDF.CopyTo(finalStream);
PDF_portrait.CopyTo(finalStream);
System.IO.FileStream file_combined = new System.IO.FileStream("abc_combined.pdf", System.IO.FileMode.Create);
try
{
PDF.WriteTo(file);
PDF.Flush();
PDF_portrait.WriteTo(file_portrait);
PDF_portrait.Flush();
finalStream.WriteTo(file_combined);
finalStream.Flush();
}
catch (Exception)
{
throw;
}
finally
{
PDF.Close();
file.Close();
PDF_portrait.Close();
file_portrait.Close();
finalStream.Close();
file_combined.Close();
}
The PDFs "abc_landscape.pdf" and "abc_portrait.pdf" generate correctly, as expected, but the operation fails when I try to combine the two in a third pdf (abc_combined.pdf).
I am using MemoryStream to preform the merge, and at the time of debug, I can see that the finalStream.length is equal to the sum of the previous two PDFs. But when I try to open the PDF, I see the content of just 1 of the two PDFs.
The same can be seen below:
Additionally, when I try to close the "abc_combined.pdf", I am prompted to save it, which does not happen with the other 2 PDFs.
Below are a few things that I have tried out already, to no avail:
Change CopyTo() to WriteTo()
Merge the same PDF (either Landscape or Portrait one) with itself
In case it is required, below is the elaboration of the GetPdfStream() method.
var htmlStream = new MemoryStream();
var writer = new StreamWriter(htmlStream);
writer.Write(htmlString);
writer.Flush();
htmlStream.Position = 0;
return htmlStream;
Process process = Process.Start(psi);
process.EnableRaisingEvents = true;
try
{
process.Start();
process.BeginErrorReadLine();
var inputTask = Task.Run(() =>
{
htmlStream.CopyTo(process.StandardInput.BaseStream);
process.StandardInput.Close();
});
// Copy the output to a memorystream
MemoryStream pdf = new MemoryStream();
var outputTask = Task.Run(() =>
{
process.StandardOutput.BaseStream.CopyTo(pdf);
});
Task.WaitAll(inputTask, outputTask);
process.WaitForExit();
// Reset memorystream read position
pdf.Position = 0;
return pdf;
}
catch (Exception ex)
{
throw ex;
}
finally
{
process.Dispose();
}
Merging pdf in C# or any other language is not straight forward with out using 3rd party library.
I assume your requirement for not using library is that most Free libraries, nuget packages has limitation or/and cost money for commercial use.
I have made research and found you an Open Source library called PdfClown with nuget package, it is also available for Java. It is Free with out limitation (donate if you like). The library has a lot of features. One such you can merge 2 or more documents to one document.
I supply my example that take a folder with multiple pdf files, merged it and save it to same or another folder. It is also possible to use MemoryStream, but I do not find it necessary in this case.
The code is self explaining, the key point here is using SerializationModeEnum.Incremental:
public static void MergePdf(string srcPath, string destFile)
{
var list = Directory.GetFiles(Path.GetFullPath(srcPath));
if (string.IsNullOrWhiteSpace(srcPath) || string.IsNullOrWhiteSpace(destFile) || list.Length <= 1)
return;
var files = list.Select(File.ReadAllBytes).ToList();
using (var dest = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(files[0])))
{
var document = dest.Document;
var builder = new org.pdfclown.tools.PageManager(document);
foreach (var file in files.Skip(1))
{
using (var src = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(file)))
{ builder.Add(src.Document); }
}
dest.Save(destFile, SerializationModeEnum.Incremental);
}
}
To test it
var srcPath = #"C:\temp\pdf\input";
var destFile = #"c:\temp\pdf\output\merged.pdf";
MergePdf(srcPath, destFile);
Input examples
PDF doc A and PDF doc B
Output example
Links to my research:
https://csharp-source.net/open-source/pdf-libraries
https://sourceforge.net/projects/clown/
https://www.oipapio.com/question-3526089
Disclaimer: A part of this answer is taken from my my personal web site https://itbackyard.com/merge-multiple-pdf-files-to-one-pdf-file-in-c/ with source code to github.
This answer from Stack Overflow (Combine two (or more) PDF's) by Andrew Burns works for me:
using (PdfDocument one = PdfReader.Open("pdf 1.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument two = PdfReader.Open("pdf 2.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument outPdf = new PdfDocument())
{
CopyPages(one, outPdf);
CopyPages(two, outPdf);
outPdf.Save("file1and2.pdf");
}
void CopyPages(PdfDocument from, PdfDocument to)
{
for (int i = 0; i < from.PageCount; i++)
{
to.AddPage(from.Pages[i]);
}
}
That's not quite how PDFs work. PDFs are structured files in a specific format.
You can't just append the bytes of one to the other and expect the result to be a valid document.
You're going to have to use a library that understands the format and can do the operation for you, or developing your own solution.
PDF files aren't just text and images. Behind the scenes there is a strict file format that describes things like PDF version, the objects contained in the file and where to find them.
In order to merge 2 PDFs you'll need to manipulate the streams.
First you'll need to conserve the header from only one of the files. This is pretty easy since it's just the first line.
Then you can write the body of the first page, and then the second.
Now the hard part, and likely the part that will convince you to use a library, is that you have to re-build the xref table. The xref table is a cross reference table that describes the content of the document and more importantly where to find each element. You'd have to calculate the byte offset of the second page, shift all of the elements in it's xref table by that much, and then add it's xref table to the first. You'll also need to ensure you create objects in the xref table for the page break.
Once that's done, you need to re-build the document trailer which tells an application where the various sections of the document are among other things.
See https://resources.infosecinstitute.com/pdf-file-format-basic-structure/
This is not trivial and you'll end up re-writing lots of code that already exists.

iText C#: Remove creation date

Is it possible to modify/remove the creation date in metadata? I'm looking to do something similar to this:
Overwrite creationDate in pdf using iText and pdf writer
EDIT:
I have tried the following methods:
writer.Info.Remove(PdfName.CREATIONDATE);
or
writer.Info.Put(PdfName.CREATIONDATE, new PdfDate(new DateTime(2017, 01, 01)));
where writer is a PdfWriter object.
However, that creates a copy of the object (a PdfDictionary) and doesn't modify the PDF I'm creating.
I also can't assign i.e. writer.Info = info
I tried following the advice given in the Java article.
I tried to do this:
var info = writer.Info;
stamper.MoreInfo = info
where stamper is a PdfStamper
But the types are incompatible and I don't think this would work. Does anyone know the actual methods to remove/modify the metadata?
EDIT 2:
Here is the code, I'm creating a new file from an existing PDF.
var filename = #"C:\Users\Someone\Documents\aPdf.pdf";
using( var output = new MemoryStream() )
{
Document document = new Document();
PdfCopy writer = new PdfCopy( document, output );
writer.CloseStream = false;
document.Open();
//read in PDF
PdfReader reader = new PdfReader(filename);
reader.ConsolidateNamedDestinations();
PdfImportedPage page = writer.GetImportedPage(reader, 1);
writer.AddPage(page);
reader.Close();
writer.Close();
document.Close();
return output.ToArray();
}
Now, when I open the file with a text editor this line is inserted (I need it constant/gone):
<</Producer(iTextSharp’ 5.5.12 ©2000-2017 iText Group NV \(AGPL-version\))/CreationDate(D:20180412155130+01'00')/ModDate(D:20180412155130+01'00')>>
The reason why we need to remove/set the date is that we're taking the MD5 hash of the file. Every time a new document is generated, that line changes leading to different MD5 hashes.
As I was trying to get a constant MD5 checksum for the generated file, I had to also set the ID constant, as mentioned by mkl.
My solution was to search byte array produced (i.e. the created PDF), and manually set the values to constants. The text is ASCII chars. I removed the /CreationDate and /ModifiedDated from the PDF entirely, and set the generated ID to a constant arbitrary value.

how to know for corrupted PDF file before merging using iTextSharp in C#

I am using iTextSharp to merge pdf pages.
But they might be some corrupted pdf.
My question is, how to verify programmatically whether the pdf is corrupted or not?
I usually check the header of a file to see what kind of file it is. A PDF header always starts with %PDF.
Ofcourse the file could be corrupted AFTER the header, then I am not really sure if there is any other way than just trying to open and read from the document. When the file is corrupted, opening OR reading from that document probably gives an exception. I am not sure iTextSharp throws all kinds of exceptions, but I think you can test that out.
One way, since you're merging files, is to wrap your code in a try...catch block:
Dictionary<string, Exception> errors =
new Dictionary<string, Exception>();
document.Open();
PdfContentByte cb = writer.DirectContent;
foreach (string filePath in testList) {
try {
PdfReader reader = new PdfReader(filePath);
int pages = reader.NumberOfPages;
for (int i = 0; i < pages; ) {
document.NewPage();
PdfImportedPage page = writer.GetImportedPage(reader, ++i);
cb.AddTemplate(page, 0, 0);
}
}
// **may** be PDF spec, but not supported by iText
catch (iTextSharp.text.exceptions.UnsupportedPdfException ue) {
errors.Add(filePath, ue);
}
// invalid according to PDF spec
catch (iTextSharp.text.exceptions.InvalidPdfException ie) {
errors.Add(filePath, ie);
}
catch (Exception e) {
errors.Add(filePath, e);
}
}
if (errors.Keys.Count > 0) {
document.NewPage();
foreach (string key in errors.Keys) {
document.Add(new Paragraph(string.Format(
"FILE: {0}\nEXCEPTION: [{1}]: {2}",
key, errors[key].GetType(), errors[key].Message
)));
}
}
where testList is a collection of file paths to the PDF documents you're merging.
On a separate note, you also need to consider what you define as corrupt. There are many PDF documents out there that do not meet PDF specs, but some readers (Adobe Reader) are smart enough to fix/repair them on the fly.

Categories

Resources