Extracting text from a PDF file

Extracting text from a PDF file - c#

I'm using PDFBox for a C# .NET project. and I'm getting a "TypeInitializationException" (The type initializer for 'java.lang.Throwable' threw an exception.) when executing the following block of code :
FileStream stream = new FileStream(#"C:\1.pdf",FileMode.Open);
//retrieve the pdf bytes from the stream.
byte[] pdfbytes=new byte[65000];
stream.Read(pdfbytes, 0, 65000);
//get the pdf file bytes.
allbytes = pdfbytes;
//create a stream from the file bytes.
java.io.InputStream ins = new java.io.ByteArrayInputStream(allbytes);
string txt;
//load the doc
PDDocument doc = PDDocument.load(ins);
PDFTextStripper stripper = new PDFTextStripper();
//retrieve the pdf doc's text
txt = stripper.getText(doc);
doc.close();
the exception occurs at the 3rd statement :
PDDocument doc = PDDocument.load(ins);
What can I do to solve this ?
This is the stack trace :
at java.lang.Throwable.__<map>(Exception , Boolean )
at org.pdfbox.pdfparser.PDFParser.parse()
at org.pdfbox.pdmodel.PDDocument.load(InputStream input, RandomAccess scratchFile)
at org.pdfbox.pdmodel.PDDocument.load(InputStream input)
at At.At.ExtractTextFromPDF(InputStream fileStream) in
C:\Users\Administrator\Documents\Visual Studio 2008\Projects\AtProject\Att\At.cs:line 61
Inner Exception of the InnerException :
InnerException {"Could not load file or assembly 'IKVM.Runtime, Version=0.30.0.0, Culture=neutral, PublicKeyToken=13235d27fcbfff58' or one of its dependencies. The system cannot find the file specified.":"IKVM.Runtime, Version=0.30.0.0, Culture=neutral, PublicKeyToken=13235d27fcbfff58"} System.Exception {System.IO.FileNotFoundException}
OK, I solved the previous problem by copying some .dll files of the PDFBox to my bin folder. but now I'm getting this error : expected='/' actual='.'--1 org.pdfbox.io.PushBackInputStream#283d742
Are there any alternatives to using PDFBox ? is there any other reliable library out there I can use to extract text from pdf files.

It looks like you missing some library for PDFBox. You need:
IKVM.GNU.Classpath.dll
PDFBox-X.X.X.dll
FontBox-X.X.X-dev.dll
IKVM.Runtime.dll
Read this topic Read from a PDF file using C#. You can find the discussion of similar problem in comments of this topic.

I found the versions of the DLL files were the culprits.
Go to http://www.netlikon.de/docs/PDFBox-0.7.2/bin/?C=M;O=A and download the following files:
IKVM.Runtime.dll
IKVM.GNU.Classpath.dll
PDFBox-0.7.2.dll
Then copy them into the root of your Visual Studio project. Right click the project and add reference, find all 3 and add them.
Finally here's the code I used to parse the PDF into Text
C#
private static string TransformPdfToText(string SourceFile)
{
string content = "";
PDDocument doc = new PDDocument();
PDFTextStripper stripper = new PDFTextStripper();
doc.close();
doc = PDDocument.load(SourceFile);
try
{
content = stripper.getText(doc);
doc.close();
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
finally
{
doc.close();
}
return content;
}
Visual Basic
Private Function parseUsingPDFBox(ByVal filename As String) As String
LogFile(" Attempting to parse file: " & filename)
Dim doc As PDDocument = New PDDocument()
Dim stripper As PDFTextStripper = New PDFTextStripper()
doc.close()
doc = PDDocument.load(filename)
Dim content As String = "empty"
Try
content = stripper.getText(doc)
doc.close()
Catch ex As Exception
LogFile(" Error parsing file: " & filename & vbcrlf & ex.Message)
Finally
doc.close()
End Try
Return content
End Function

had a similar problem but not with C++ but VisualBasic/VisualStudio; the missing dll is "commons-logging.dll"; after adding this dll to the bin-directory everything worked find

Related

Export information to an xlsm file, C# VS asp.net

is there a way to export information to an xlsm file? the steps I do is:
in a button I put an input to select the file, I upload the file to the server
I look for the sheet which is already specified in the code
I modify the file information according to the information to be exported
command to save the file locally.
the error is as follows:
{"The 'br' start tag on line 59 position 30 does not match the end tag of 'font'. Line 60, position 9."}
when indicating the sheet with which to work
I share my code: any suggestions?
public void ExportFile(string FileName, string UserID)
{
FileInfo fi = new FileInfo(FileName);
Master.MSGError = string.Empty;
string SheetName = "test";
using (MemoryStream file = new MemoryStream())
{
try
{
using (ExcelPackage xlPackage = new ExcelPackage(fi))
{
ExcelWorksheet worksheet;
worksheet = xlPackage.Workbook.Worksheets[SheetName]; //here is the error exception
worksheet.Cells[1, 1].Value = "TEST";
//save file
xlPackage.SaveAs(file);
Response.ContentType = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet";
Response.BinaryWrite(file.ToArray());
Response.Flush();
Response.End();
}
}
catch (Exception ex)
{
Master.fc.MSGError = ex.Message;
}
}
}

Currently I solved my problem I thought that the detail was in the macro, but I found the real error doing different tests, it seems that both epplus and closedxml have problems reading certain information in the excel, I ended up using closedxml and applying the solution:
OpenXml Excel: throw error in any word after mail address
I'm sorry for confusion

IText 7 in C# locking bad pdf

I am running into a problem where I am using IText 7 to check a PDF that a user has downloaded off the internet.
For my test case I created a text file with garbage in it and saved it as a pdf. I know its not valid.
In the code I am trying to open the PDF using PDFReader.
An exception is being thrown, this is expected.
When debugging the code the Reader object is null when it gets to the finally spot. So the
reader.close() isn't even firing.
I am even copying the file to a temp directory just to ensure nothing else is holding the file.
I am then unable to delete the PDF file either in code or manually in a file explorer after the exception.
Here is some of my code. I removed everything but the Reader part. Also this code is after I have tried a few things, so you are seeing my attempt with the file being copied to a temp file. I am attempted to delete the temp file in the finally part. That is failing on a corrupt file.
Here are both the exceptions that are thrown when attempting to validate a bad PDF. The first is from the PDFReader call.
2021-04-09 13:18:11,079 ERROR GUI.Form1 - PDF header not found.
iText.IO.IOException: PDF header not found. at
iText.IO.Source.PdfTokenizer.GetHeaderOffset() at
iText.Kernel.Pdf.PdfReader.GetOffsetTokeniser(IRandomAccessSource> byteSource) at
iText.Kernel.Pdf.PdfReader..ctor(String filename, ReaderProperties properties) at
iText.Kernel.Pdf.PdfReader..ctor(FileInfo file) at
GUI.Form1.validatePDF(FileInfo pdfFile, HashSet`1 tmpMd5s)
The Second is from the attempt to delete the temp file
2021-04-09 13:18:11,116 ERROR GUI.Form1 - The process cannot access the file
'C:\Users\ret63\AppData\Local\Temp\tmp27DE.tmp' because it is being used by another process.
System.IO.IOException: The process cannot access the file 'C:\Users\ret63\AppData\Local\Temp\tmp27DE.tmp' because it is being used by another process. at
System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath) at System.IO.FileInfo.Delete() at
GUI.Form1.validatePDF(FileInfo pdfFile, HashSet`1 tmpMd5s)
PdfDocument pdfDoc = null;
PdfReader reader = null;
try
{
using (reader = new PdfReader(testFile))
{
//pdfDoc = new PdfDocument(reader);
//pdfDoc = new PdfDocument(new PdfReader(pdfFile.FullName));
//Console.WriteLine("Number of Pages: " + pdfDoc.GetNumberOfPages());
//pdfDoc.Close();
}
}
catch(Exception ex)
{
log.Error(ex.Message, ex);
throw new Exception("Invalid PDF File: " + pdfFile.Name);
}
finally
{
if (reader != null)
{
reader.Close();
}
if (pdfDoc != null && !pdfDoc.IsClosed())
{
pdfDoc.Close();
}
try
{
if (testFile.Exists)
{
testFile.Delete();
}
}
catch (Exception ee)
{
Console.WriteLine(ee.Message);
}
}

Looks like an iText bug. If you trace out what gets called by the PdfReader constructor, you see that it creates a FileStream that is conditionally locked. The FileStream gets wrapped in a RandomAccessSource which is then wrapped in a PdfTokenizer in GetOffsetTokeniser. If GetHeaderOffset throws on line 1433, that tok local is never closed.

c# docx load xml from string xml

I have created a program to read a file as array of bytes. The program is consuming word files by using docx library from Xceed. What I want to do is to recreate the parsed docx file from array of bytes.
To bytes:
var doc = Docx.Load("afile.docx");
...
return Encoding.Unicode.GetBytes(doc.Xml.Document.ToString());
Parse:
var doc = Docx.Create("anotherFile.docx");
var document = Encoding.Unicode.GetBytes({--returned bytes--}); <-- document is string with xml
How to save the document like the original?
I'm getting only blank file without any content.

using (var doc = DocX.Load("afile.docx"))
{
//here modify
doc.SaveAs("anotherFile.docx");
}

See this document BinaryWriter
bWriter.Writebytes(bytearray);

System.XML and Encoding trouble

I have an application which is using to create XML documents on the example of existing. But that's not the point. Today I noticed that there is an error if the opened file encoding is ANSI. Before that I worked with files UTF-8 and this problem does not arise. What should you do and how?
Fragments of code:
string filepath;
XmlDocument xdoc = new XmlDocument();
XmlElement root;
...............
if (openFileDialog1.ShowDialog() == DialogResult.OK)
{
filepath = openFileDialog1.FileName;
textBox1.Text = filepath;
load();
}
...............
public void load()
{
xdoc.Load(filepath);
root = xdoc.DocumentElement;
...............
Error:
An unhandled exception of type 'System.Xml.XmlException' occurred in
System.Xml.dll Additional information: An invalid character for the
specified encoding., Line 35, position 16.
In that line is Cyrillic symbols (russian language). But if I converted this document to UTF-8 by NotePad++ - it loaded correctly.

You could use a StreamReader to read the file with the correct encoding and then load that stream into the XmlDocument overload that accepts a stream.
using(var sr = new StreamReader(filepath, myEncoding))
{
xdoc.Load(sr);
}
You can obtain myEncoding via the GetEncoding method.

iTextSharp exception: PDF header signature not found

I'm using iTextSharp to read the contents of PDF documents:
PdfReader reader = new PdfReader(pdfPath);
using (StringWriter output = new StringWriter())
{
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
reader.Close();
pdfText = output.ToString();
}
99% of the time it works just fine. However, there is this one PDF file that will sometimes throw this exception:
PDF header signature not found. StackTrace: at
iTextSharp.text.pdf.PRTokeniser.CheckPdfHeader() at
iTextSharp.text.pdf.PdfReader.ReadPdf() at
iTextSharp.text.pdf.PdfReader..ctor(String filename, Byte[]> ownerPassword) at
Reader.PDF.DownloadPdf(String url) in
What's annoying is that I can't always reproduce the error. Sometimes it works, sometimes it doesn't. Has anyone encountered this problem?

After some research, I've found that this problem relates to either a file being corrupted during PDF generation, or an error related to an object in the document that doesn't conform to the PDF standard as implemented in iTextSharp. It also seems to happen only when you read from a PDF file from disk.
I have not found a complete solution to the problem, but only a workaround. What I've done is read the PDF document using the PdfReader itextsharp object and see if an error or exception happens before reading the file in a normal operation.
So running something similar to this:
private bool IsValidPdf(string filepath)
{
bool Ret = true;
PdfReader reader = null;
try
{
reader = new PdfReader(filepath);
}
catch
{
Ret = false;
}
return Ret;
}

I found it was because I was calling new PdfReader(pdf) with the PDF stream position at the end of the file. By setting the position to zero it resolved the issue.
Before:
// Throws: InvalidPdfException: PDF header signature not found.
var pdfReader = new PdfReader(pdf);
After:
// Works correctly.
pdf.Position = 0;
var pdfReader = new PdfReader(pdf);

In my case, it was because I was calling a .json file, and iTextSharp only accepts pdf file obviously.

There is the possibility that you are opening the file with another method or program as was my case. Verify that nothing is working with your file, you can also use the resource monitor to verify which processes are working on your file.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extracting text from a PDF file - c#

It looks like you missing some library for PDFBox. You need: IKVM.GNU.Classpath.dll PDFBox-X.X.X.dll FontBox-X.X.X-dev.dll IKVM.Runtime.dll Read this topic Read from a PDF file using C#. You can find the discussion of similar problem in comments of this topic.

had a similar problem but not with C++ but VisualBasic/VisualStudio; the missing dll is "commons-logging.dll"; after adding this dll to the bin-directory everything worked find

Related

Export information to an xlsm file, C# VS asp.net

IText 7 in C# locking bad pdf

c# docx load xml from string xml

System.XML and Encoding trouble

iTextSharp exception: PDF header signature not found

Categories

Resources