I am running into a problem where I am using IText 7 to check a PDF that a user has downloaded off the internet.
For my test case I created a text file with garbage in it and saved it as a pdf. I know its not valid.
In the code I am trying to open the PDF using PDFReader.
An exception is being thrown, this is expected.
When debugging the code the Reader object is null when it gets to the finally spot. So the
reader.close() isn't even firing.
I am even copying the file to a temp directory just to ensure nothing else is holding the file.
I am then unable to delete the PDF file either in code or manually in a file explorer after the exception.
Here is some of my code. I removed everything but the Reader part. Also this code is after I have tried a few things, so you are seeing my attempt with the file being copied to a temp file. I am attempted to delete the temp file in the finally part. That is failing on a corrupt file.
Here are both the exceptions that are thrown when attempting to validate a bad PDF. The first is from the PDFReader call.
2021-04-09 13:18:11,079 ERROR GUI.Form1 - PDF header not found.
iText.IO.IOException: PDF header not found. at
iText.IO.Source.PdfTokenizer.GetHeaderOffset() at
iText.Kernel.Pdf.PdfReader.GetOffsetTokeniser(IRandomAccessSource> byteSource) at
iText.Kernel.Pdf.PdfReader..ctor(String filename, ReaderProperties properties) at
iText.Kernel.Pdf.PdfReader..ctor(FileInfo file) at
GUI.Form1.validatePDF(FileInfo pdfFile, HashSet`1 tmpMd5s)
The Second is from the attempt to delete the temp file
2021-04-09 13:18:11,116 ERROR GUI.Form1 - The process cannot access the file
'C:\Users\ret63\AppData\Local\Temp\tmp27DE.tmp' because it is being used by another process.
System.IO.IOException: The process cannot access the file 'C:\Users\ret63\AppData\Local\Temp\tmp27DE.tmp' because it is being used by another process. at
System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath) at System.IO.FileInfo.Delete() at
GUI.Form1.validatePDF(FileInfo pdfFile, HashSet`1 tmpMd5s)
PdfDocument pdfDoc = null;
PdfReader reader = null;
try
{
using (reader = new PdfReader(testFile))
{
//pdfDoc = new PdfDocument(reader);
//pdfDoc = new PdfDocument(new PdfReader(pdfFile.FullName));
//Console.WriteLine("Number of Pages: " + pdfDoc.GetNumberOfPages());
//pdfDoc.Close();
}
}
catch(Exception ex)
{
log.Error(ex.Message, ex);
throw new Exception("Invalid PDF File: " + pdfFile.Name);
}
finally
{
if (reader != null)
{
reader.Close();
}
if (pdfDoc != null && !pdfDoc.IsClosed())
{
pdfDoc.Close();
}
try
{
if (testFile.Exists)
{
testFile.Delete();
}
}
catch (Exception ee)
{
Console.WriteLine(ee.Message);
}
}
Looks like an iText bug. If you trace out what gets called by the PdfReader constructor, you see that it creates a FileStream that is conditionally locked. The FileStream gets wrapped in a RandomAccessSource which is then wrapped in a PdfTokenizer in GetOffsetTokeniser. If GetHeaderOffset throws on line 1433, that tok local is never closed.
Related
I'm attempting to automate a mail merge process using c#, a DataSet, and OpenXML. I have a complete working example when running locally. When publishing to our webserver however, I'm getting an Access Denied error despite even going so far as to grant Full Control everywhere.
Here is the code leading up to the error message:
try
{
var strTemplateTestFile = strMergeBuildingLocation.Replace(".docx", "_Test.docx");
// Don't continue if the template file name is not found
if (!File.Exists(strTemplateFileName))
throw new Exception("TemplateFileName (" + strTemplateFileName + ") does not exist");
foreach (var dr in dsData.Tables[0].Rows)
{
string strFileName;
if (doesDestinationExist(strMergeBuildingLocation))
{
File.Copy(strTemplateFileName, strTemplateTestFile, true);
strFileName = strTemplateTestFile;
}
else
{
File.Copy(strTemplateFileName, strMergeBuildingLocation, true);
strFileName = strMergeBuildingLocation;
}
var pkg = Package.Open(strFileName, FileMode.Open, FileAccess.ReadWrite);
using (var docGenerated = WordprocessingDocument.Open(pkg))
The problem falls within the last line upon attempting to open docGenerated.
The error message I'm receiving is:
Access to the path 'docx path' is denied.
The file copies as expected and is able to be opened and modified manually. There's nothing within the folders that would be restricting access to the file. Does anyone have any thoughts as to what the issue could be?
I think the problem is here in this line.
var pkg = Package.Open(strFileName, FileMode.Open, FileAccess.ReadWrite);
using (var docGenerated = WordprocessingDocument.Open(pkg))
Here you are trying to open the document twice.
Try this,
/* Open WordProcessing document package based on filename */
//------------------------------------------------------------------------------------------------Start
public static WordprocessingDocument OpenPackage(WordprocessingDocument package, string inputFileName, bool editable)
{
bool copied = false;
while (!copied)
{
try
{
package = WordprocessingDocument.Open(inputFileName, editable);
copied = true;
}
catch (Exception e)
{
if (e is FileFormatException)
{
package = null;
break;
}
if (e is IOException)
{
copied = false;
}
if (e is ZipException)
{
package = null;
break;
}
}
}
return package;
}
This will give you the WordprocessingDocument package if it exists and available. If file does not exists, null will be returned.If file locked, will open package when file released.
Hope this helps.! Thank you!
I'm using iTextSharp to read the contents of PDF documents:
PdfReader reader = new PdfReader(pdfPath);
using (StringWriter output = new StringWriter())
{
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
reader.Close();
pdfText = output.ToString();
}
99% of the time it works just fine. However, there is this one PDF file that will sometimes throw this exception:
PDF header signature not found. StackTrace: at
iTextSharp.text.pdf.PRTokeniser.CheckPdfHeader() at
iTextSharp.text.pdf.PdfReader.ReadPdf() at
iTextSharp.text.pdf.PdfReader..ctor(String filename, Byte[]> ownerPassword) at
Reader.PDF.DownloadPdf(String url) in
What's annoying is that I can't always reproduce the error. Sometimes it works, sometimes it doesn't. Has anyone encountered this problem?
After some research, I've found that this problem relates to either a file being corrupted during PDF generation, or an error related to an object in the document that doesn't conform to the PDF standard as implemented in iTextSharp. It also seems to happen only when you read from a PDF file from disk.
I have not found a complete solution to the problem, but only a workaround. What I've done is read the PDF document using the PdfReader itextsharp object and see if an error or exception happens before reading the file in a normal operation.
So running something similar to this:
private bool IsValidPdf(string filepath)
{
bool Ret = true;
PdfReader reader = null;
try
{
reader = new PdfReader(filepath);
}
catch
{
Ret = false;
}
return Ret;
}
I found it was because I was calling new PdfReader(pdf) with the PDF stream position at the end of the file. By setting the position to zero it resolved the issue.
Before:
// Throws: InvalidPdfException: PDF header signature not found.
var pdfReader = new PdfReader(pdf);
After:
// Works correctly.
pdf.Position = 0;
var pdfReader = new PdfReader(pdf);
In my case, it was because I was calling a .json file, and iTextSharp only accepts pdf file obviously.
There is the possibility that you are opening the file with another method or program as was my case. Verify that nothing is working with your file, you can also use the resource monitor to verify which processes are working on your file.
If a file is open by another application, and then I try to save it through the Silverlight SaveDialog, I can catch the error with an exception but after that I get this error.
Line: 57
Error: Unhandled Error in Silverlight Application
Code: 4004
Category: ManagedRuntimeError
Message: System.InvalidOperationException: This operation can only occur on the UI Thread.
at System.Windows.Hosting.NativeHost.VerifyThread()
at System.Windows.SaveFileStream.Dispose(Boolean disposing)
at System.IO.FileStream.Finalize()
I would prefer to detect that the file is open, but can't seem to do that. I tried fs.CanWrite, but it returns true, even when the file is open by another application.
EDIT: Here is a post on the silverlight forum that seems to explain what is happening, although they think it's just Office files. I'm having the problem with a PDF file.
Here is my code:
public void PDFSaveFile(bool success)
{
// silverlight requires saveFileDialog to be user-initiated,
// so this is called from the OK button of a pop-up window
// ignore success, we only gave an OK option
byte[] fileBytes = doc.ToPDF().ToArray();
PDFClose();
try
{
SaveFileDialog saveFileDlg = new SaveFileDialog();
saveFileDlg.Filter = "PDF files (*.pdf)|*.pdf";
bool? dialogResult = saveFileDlg.ShowDialog();
if (dialogResult == true)
{
using (var fs = saveFileDlg.OpenFile())
{
fs.Write(fileBytes, 0, fileBytes.Length);
fs.Close();
}
}
}
catch (Exception ex)
{
Log.HandleInternalError(string.Format("Unable to save file: {0}",ex.Message));
}
}
You can use
FileInfo.Open
if return a IOException = File already opened
FileInfo.Open
The following code gives me a System.IO.IOException with the message 'The process cannot access the file'.
private void UnPackLegacyStats()
{
DirectoryInfo oDirectory;
XmlDocument oStatsXml;
//Get the directory
oDirectory = new DirectoryInfo(msLegacyStatZipsPath);
//Check if the directory exists
if (oDirectory.Exists)
{
//Loop files
foreach (FileInfo oFile in oDirectory.GetFiles())
{
//Check if file is a zip file
if (C1ZipFile.IsZipFile(oFile.FullName))
{
//Open the zip file
using (C1ZipFile oZipFile = new C1ZipFile(oFile.FullName, false))
{
//Check if the zip contains the stats
if (oZipFile.Entries.Contains("Stats.xml"))
{
//Get the stats as a stream
using (Stream oStatsStream = oZipFile.Entries["Stats.xml"].OpenReader())
{
//Load the stats as xml
oStatsXml = new XmlDocument();
oStatsXml.Load(oStatsStream);
//Close the stream
oStatsStream.Close();
}
//Loop hit elements
foreach (XmlElement oHitElement in oStatsXml.SelectNodes("/*/hits"))
{
//Do stuff
}
}
//Close the file
oZipFile.Close();
}
}
//Delete the file
oFile.Delete();
}
}
}
I am struggling to see where the file could still be locked. All objects that could be holding onto a handle to the file are in using blocks and are explicitly closed.
Is it something to do with using FileInfo objects rather than the strings returned by the static GetFiles method?
Any ideas?
I do not see problems in your code, everything look ok. To check is the problem lies in C1ZipFile I suggest you initialize zip from stream, instead of initialization from file, so you close stream explicitly:
//Open the zip file
using (Stream ZipStream = oFile.OpenRead())
using (C1ZipFile oZipFile = new C1ZipFile(ZipStream, false))
{
// ...
Several other suggestions:
You do not need to call Close() method, with using (...), remove them.
Move xml processing (Loop hit elements) outsize zip processing, i.e. after zip file closeing, so you keep file opened as least as possible.
I assume you're getting the error on the oFile.Delete call. I was able to reproduce this error. Interestingly, the error only occurs when the file is not a zip file. Is this the behavior you are seeing?
It appears that the C1ZipFile.IsZipFile call is not releasing the file when it's not a zip file. I was able to avoid this problem by using a FileStream instead of passing the file path as a string (the IsZipFile function accepts either).
So the following modification to your code seems to work:
if (oDirectory.Exists)
{
//Loop files
foreach (FileInfo oFile in oDirectory.GetFiles())
{
using (FileStream oStream = new FileStream(oFile.FullName, FileMode.Open))
{
//Check if file is a zip file
if (C1ZipFile.IsZipFile(oStream))
{
// ...
}
}
//Delete the file
oFile.Delete();
}
}
In response to the original question in the subject: I don't know if it's possible to know if a file can be deleted without attempting to delete it. You could always write a function that attempts to delete the file and catches the error if it can't and then returns a boolean indicating whether the delete was successful.
I'm just guessing: are you sure that oZipFile.Close() is enough? Perhaps you have to call oZipFile.Dispose() or oZipFile.Finalize() to be sure it has actually released the resources.
More then Likely it's not being disposed, anytime you access something outside of managed code(streams, files, etc.) you MUST dispose of them. I learned the hard way with Asp.NET and Image files, it will fill up your memory, crash your server, etc.
In the interest of completeness I am posing my working code as the changes came from more than one source.
private void UnPackLegacyStats()
{
DirectoryInfo oDirectory;
XmlDocument oStatsXml;
//Get the directory
oDirectory = new DirectoryInfo(msLegacyStatZipsPath);
//Check if the directory exists
if (oDirectory.Exists)
{
//Loop files
foreach (FileInfo oFile in oDirectory.GetFiles())
{
//Set empty xml
oStatsXml = null;
//Load file into a stream
using (Stream oFileStream = oFile.OpenRead())
{
//Check if file is a zip file
if (C1ZipFile.IsZipFile(oFileStream))
{
//Open the zip file
using (C1ZipFile oZipFile = new C1ZipFile(oFileStream, false))
{
//Check if the zip contains the stats
if (oZipFile.Entries.Contains("Stats.xml"))
{
//Get the stats as a stream
using (Stream oStatsStream = oZipFile.Entries["Stats.xml"].OpenReader())
{
//Load the stats as xml
oStatsXml = new XmlDocument();
oStatsXml.Load(oStatsStream);
}
}
}
}
}
//Check if we have stats
if (oStatsXml != null)
{
//Process XML here
}
//Delete the file
oFile.Delete();
}
}
}
The main lesson I learned from this is to manage file access in one place in the calling code rather than letting other components manage their own file access. This is most apropriate when you want to use the file again after the other component has finished it's task.
Although this takes a little more code you can clearly see where the stream is disposed (at the end of the using), compared to having to trust that a component has correctly disposed of the stream.
I'm using PDFBox for a C# .NET project. and I'm getting a "TypeInitializationException" (The type initializer for 'java.lang.Throwable' threw an exception.) when executing the following block of code :
FileStream stream = new FileStream(#"C:\1.pdf",FileMode.Open);
//retrieve the pdf bytes from the stream.
byte[] pdfbytes=new byte[65000];
stream.Read(pdfbytes, 0, 65000);
//get the pdf file bytes.
allbytes = pdfbytes;
//create a stream from the file bytes.
java.io.InputStream ins = new java.io.ByteArrayInputStream(allbytes);
string txt;
//load the doc
PDDocument doc = PDDocument.load(ins);
PDFTextStripper stripper = new PDFTextStripper();
//retrieve the pdf doc's text
txt = stripper.getText(doc);
doc.close();
the exception occurs at the 3rd statement :
PDDocument doc = PDDocument.load(ins);
What can I do to solve this ?
This is the stack trace :
at java.lang.Throwable.__<map>(Exception , Boolean )
at org.pdfbox.pdfparser.PDFParser.parse()
at org.pdfbox.pdmodel.PDDocument.load(InputStream input, RandomAccess scratchFile)
at org.pdfbox.pdmodel.PDDocument.load(InputStream input)
at At.At.ExtractTextFromPDF(InputStream fileStream) in
C:\Users\Administrator\Documents\Visual Studio 2008\Projects\AtProject\Att\At.cs:line 61
Inner Exception of the InnerException :
InnerException {"Could not load file or assembly 'IKVM.Runtime, Version=0.30.0.0, Culture=neutral, PublicKeyToken=13235d27fcbfff58' or one of its dependencies. The system cannot find the file specified.":"IKVM.Runtime, Version=0.30.0.0, Culture=neutral, PublicKeyToken=13235d27fcbfff58"} System.Exception {System.IO.FileNotFoundException}
OK, I solved the previous problem by copying some .dll files of the PDFBox to my bin folder. but now I'm getting this error : expected='/' actual='.'--1 org.pdfbox.io.PushBackInputStream#283d742
Are there any alternatives to using PDFBox ? is there any other reliable library out there I can use to extract text from pdf files.
It looks like you missing some library for PDFBox. You need:
IKVM.GNU.Classpath.dll
PDFBox-X.X.X.dll
FontBox-X.X.X-dev.dll
IKVM.Runtime.dll
Read this topic Read from a PDF file using C#. You can find the discussion of similar problem in comments of this topic.
I found the versions of the DLL files were the culprits.
Go to http://www.netlikon.de/docs/PDFBox-0.7.2/bin/?C=M;O=A and download the following files:
IKVM.Runtime.dll
IKVM.GNU.Classpath.dll
PDFBox-0.7.2.dll
Then copy them into the root of your Visual Studio project. Right click the project and add reference, find all 3 and add them.
Finally here's the code I used to parse the PDF into Text
C#
private static string TransformPdfToText(string SourceFile)
{
string content = "";
PDDocument doc = new PDDocument();
PDFTextStripper stripper = new PDFTextStripper();
doc.close();
doc = PDDocument.load(SourceFile);
try
{
content = stripper.getText(doc);
doc.close();
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
finally
{
doc.close();
}
return content;
}
Visual Basic
Private Function parseUsingPDFBox(ByVal filename As String) As String
LogFile(" Attempting to parse file: " & filename)
Dim doc As PDDocument = New PDDocument()
Dim stripper As PDFTextStripper = New PDFTextStripper()
doc.close()
doc = PDDocument.load(filename)
Dim content As String = "empty"
Try
content = stripper.getText(doc)
doc.close()
Catch ex As Exception
LogFile(" Error parsing file: " & filename & vbcrlf & ex.Message)
Finally
doc.close()
End Try
Return content
End Function
had a similar problem but not with C++ but VisualBasic/VisualStudio; the missing dll is "commons-logging.dll"; after adding this dll to the bin-directory everything worked find