Text extraction using itext7: garbage characters for some pdf documents

Text extraction using itext7: garbage characters for some pdf documents - c#

I have a problem extracting text from pdf documents using iText7. For documents coming from a specific source textRenderInfo.GetText() returns only garbage chars (0xfdff) in the event handler of my extraction strategy:
internal class CustomExtractionStrategy : ITextExtractionStrategy
{
public virtual void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals((object)EventType.RENDER_TEXT))
{
return;
}
var textRenderInfo = (TextRenderInfo)data;
bool currentResultEmpty = _result.Length == 0;
bool isInNewLine = false;
var baseline = textRenderInfo.GetBaseline();
var startPoint = baseline.GetStartPoint();
var endPoint = baseline.GetEndPoint();
var currentText = textRenderInfo.GetText(); // returns garbage for specific pdfs
// further processing below
...
}
}
I'm not very familiar with the way text/glyph encoding words in PDF but I try to give some details when comparing the problematic pdfs with an example where extraction works. For the pdfs with issues:
textRenderInfo.gs.font is MS-UIGothic
textRenderInfo.gs.font.fontProgram.codeToGlyph contains only mapping (key: 0 to a Glyph with width 1000, unicode -1, code 0)
textRenderInfo.gs.font.fontProgram.unicodeToGlyph contains no records
These are the most obvious discrepancies. If there's any thing else I should look out for please let me know. I would have provided an example of the PDF in question but it might have sensitive information that I must not disclose.
Note: the PDFs can be correctly read in Acrobat Reader and I can copy text from the reader into notepad. Other libraries (pdfium based or ports of PDFBox) can properly extract text from the document. So I think the document as such is "valid".
If this is a known issue for iText7, is there any workaround (other than using a different library altogether)?
Update
With the link provided in the comment and the following code (in addition to the custom extraction strategy snippet shown above) I get garbage chars see VS screenshot:
internal class PdfExtractor
{
internal void ExtractFromPath(string path)
{
PdfReader reader = new PdfReader(path);
var document = new iText.Kernel.Pdf.PdfDocument(reader);
for (int pageNum = 1; pageNum <= document.GetNumberOfPages(); pageNum++)
{
var page = document.GetPage(pageNum);
string text = PdfTextExtractor.GetTextFromPage(page, new CustomExtractionStrategy());
}
}
}

Related

Reading PDF in net core with itext7 returns "\n\n\n\n\n...."

i have a netcore 3 app to read and split a PDF containing paychecks of some companies which i am working for.
This app ran pretty well since last builds... my the way, the PDF reader started to fail to parse the contents of any PDF.
PDF is built only with Italian words, no special chars. Few tables and a single logo. I'm not able to attach it due to privacy.
public PaycheckSplitter Read()
{
using (var reader = new PdfReader(new MemoryStream(this._stream)))
{
var doc = new PdfDocument(reader);
this.Paycheck = new PaychecksCollection();
for (int i = 1; i <= doc.GetNumberOfPages(); i++)
{
PdfPage page = doc.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, new LocationTextExtractionStrategy());
if (text.Contains(Consts.BpEnd)) break;
// trying to find something by regex... btw text contains only a sequence of \n\n\n\n...
string cf = Consts.CodFiscale.Match(text).Value;
this.Paychecks.Add(new Paycheck(cf), i);
}
doc.Close();
}
return this;
}
Anything i can do?
As far as i can see... the only and best way to have something to read a PDF text for free is iText7...

Unable to merge 2 PDFs using MemoryStream

I have a c# class that takes an HTML and converts it to PDF using wkhtmltopdf.
As you will see below, I am generating 3 PDFs - Landscape, Portrait, and combined of the two.
The properties object contains the html as a string, and the argument for landscape/portrait.
System.IO.MemoryStream PDF = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file = new System.IO.FileStream("abc_landscape.pdf", System.IO.FileMode.Create);
PDF.Position = 0;
properties.IsHorizontalOrientation = false;
System.IO.MemoryStream PDF_portrait = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file_portrait = new System.IO.FileStream("abc_portrait.pdf", System.IO.FileMode.Create);
PDF_portrait.Position = 0;
System.IO.MemoryStream finalStream = new System.IO.MemoryStream();
PDF.CopyTo(finalStream);
PDF_portrait.CopyTo(finalStream);
System.IO.FileStream file_combined = new System.IO.FileStream("abc_combined.pdf", System.IO.FileMode.Create);
try
{
PDF.WriteTo(file);
PDF.Flush();
PDF_portrait.WriteTo(file_portrait);
PDF_portrait.Flush();
finalStream.WriteTo(file_combined);
finalStream.Flush();
}
catch (Exception)
{
throw;
}
finally
{
PDF.Close();
file.Close();
PDF_portrait.Close();
file_portrait.Close();
finalStream.Close();
file_combined.Close();
}
The PDFs "abc_landscape.pdf" and "abc_portrait.pdf" generate correctly, as expected, but the operation fails when I try to combine the two in a third pdf (abc_combined.pdf).
I am using MemoryStream to preform the merge, and at the time of debug, I can see that the finalStream.length is equal to the sum of the previous two PDFs. But when I try to open the PDF, I see the content of just 1 of the two PDFs.
The same can be seen below:
Additionally, when I try to close the "abc_combined.pdf", I am prompted to save it, which does not happen with the other 2 PDFs.
Below are a few things that I have tried out already, to no avail:
Change CopyTo() to WriteTo()
Merge the same PDF (either Landscape or Portrait one) with itself
In case it is required, below is the elaboration of the GetPdfStream() method.
var htmlStream = new MemoryStream();
var writer = new StreamWriter(htmlStream);
writer.Write(htmlString);
writer.Flush();
htmlStream.Position = 0;
return htmlStream;
Process process = Process.Start(psi);
process.EnableRaisingEvents = true;
try
{
process.Start();
process.BeginErrorReadLine();
var inputTask = Task.Run(() =>
{
htmlStream.CopyTo(process.StandardInput.BaseStream);
process.StandardInput.Close();
});
// Copy the output to a memorystream
MemoryStream pdf = new MemoryStream();
var outputTask = Task.Run(() =>
{
process.StandardOutput.BaseStream.CopyTo(pdf);
});
Task.WaitAll(inputTask, outputTask);
process.WaitForExit();
// Reset memorystream read position
pdf.Position = 0;
return pdf;
}
catch (Exception ex)
{
throw ex;
}
finally
{
process.Dispose();
}

Merging pdf in C# or any other language is not straight forward with out using 3rd party library.
I assume your requirement for not using library is that most Free libraries, nuget packages has limitation or/and cost money for commercial use.
I have made research and found you an Open Source library called PdfClown with nuget package, it is also available for Java. It is Free with out limitation (donate if you like). The library has a lot of features. One such you can merge 2 or more documents to one document.
I supply my example that take a folder with multiple pdf files, merged it and save it to same or another folder. It is also possible to use MemoryStream, but I do not find it necessary in this case.
The code is self explaining, the key point here is using SerializationModeEnum.Incremental:
public static void MergePdf(string srcPath, string destFile)
{
var list = Directory.GetFiles(Path.GetFullPath(srcPath));
if (string.IsNullOrWhiteSpace(srcPath) || string.IsNullOrWhiteSpace(destFile) || list.Length <= 1)
return;
var files = list.Select(File.ReadAllBytes).ToList();
using (var dest = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(files[0])))
{
var document = dest.Document;
var builder = new org.pdfclown.tools.PageManager(document);
foreach (var file in files.Skip(1))
{
using (var src = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(file)))
{ builder.Add(src.Document); }
}
dest.Save(destFile, SerializationModeEnum.Incremental);
}
}
To test it
var srcPath = #"C:\temp\pdf\input";
var destFile = #"c:\temp\pdf\output\merged.pdf";
MergePdf(srcPath, destFile);
Input examples
PDF doc A and PDF doc B
Output example
Links to my research:
https://csharp-source.net/open-source/pdf-libraries
https://sourceforge.net/projects/clown/
https://www.oipapio.com/question-3526089
Disclaimer: A part of this answer is taken from my my personal web site https://itbackyard.com/merge-multiple-pdf-files-to-one-pdf-file-in-c/ with source code to github.

This answer from Stack Overflow (Combine two (or more) PDF's) by Andrew Burns works for me:
using (PdfDocument one = PdfReader.Open("pdf 1.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument two = PdfReader.Open("pdf 2.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument outPdf = new PdfDocument())
{
CopyPages(one, outPdf);
CopyPages(two, outPdf);
outPdf.Save("file1and2.pdf");
}
void CopyPages(PdfDocument from, PdfDocument to)
{
for (int i = 0; i < from.PageCount; i++)
{
to.AddPage(from.Pages[i]);
}
}

That's not quite how PDFs work. PDFs are structured files in a specific format.
You can't just append the bytes of one to the other and expect the result to be a valid document.
You're going to have to use a library that understands the format and can do the operation for you, or developing your own solution.

PDF files aren't just text and images. Behind the scenes there is a strict file format that describes things like PDF version, the objects contained in the file and where to find them.
In order to merge 2 PDFs you'll need to manipulate the streams.
First you'll need to conserve the header from only one of the files. This is pretty easy since it's just the first line.
Then you can write the body of the first page, and then the second.
Now the hard part, and likely the part that will convince you to use a library, is that you have to re-build the xref table. The xref table is a cross reference table that describes the content of the document and more importantly where to find each element. You'd have to calculate the byte offset of the second page, shift all of the elements in it's xref table by that much, and then add it's xref table to the first. You'll also need to ensure you create objects in the xref table for the page break.
Once that's done, you need to re-build the document trailer which tells an application where the various sections of the document are among other things.
See https://resources.infosecinstitute.com/pdf-file-format-basic-structure/
This is not trivial and you'll end up re-writing lots of code that already exists.

Merging Word documents using WordProcessingDocument

I am currently working on a program in which a user should be able to merge several Word documents into one, without losing any formatting, headers and so on. The documents should simply stack up, one after another, without any changes.
Here is my current code:
public virtual Byte[] MergeWordFiles(IEnumerable<SendData> sourceFiles)
{
int f = 0;
// If only one Word document then skip merge.
if (sourceFiles.Count() == 1)
{
return sourceFiles.First().File;
}
else
{
MemoryStream destinationFile = new MemoryStream();
// Add first file
var firstFile = sourceFiles.First().File;
destinationFile.Write(firstFile, 0, firstFile.Length);
destinationFile.Position = 0;
int pointer = 1;
byte[] ret;
// Add the rest of the files
try
{
using (WordprocessingDocument mainDocument = WordprocessingDocument.Open(destinationFile, true))
{
XElement newBody = XElement.Parse(mainDocument.MainDocumentPart.Document.Body.OuterXml);
for (pointer = 1; pointer < sourceFiles.Count(); pointer++)
{
WordprocessingDocument tempDocument = WordprocessingDocument.Open(new MemoryStream(sourceFiles.ElementAt(pointer).File), true);
XElement tempBody = XElement.Parse(tempDocument.MainDocumentPart.Document.Body.OuterXml);
newBody.Add(XElement.Parse(new DocumentFormat.OpenXml.Wordprocessing.Paragraph(new Run(new Break { Type = BreakValues.Page })).OuterXml));
newBody.Add(tempBody);
mainDocument.MainDocumentPart.Document.Body = new Body(newBody.ToString());
mainDocument.MainDocumentPart.Document.Save();
mainDocument.Package.Flush();
}
}
}
catch (OpenXmlPackageException oxmle)
{
throw new Exception(string.Format(CultureInfo.CurrentCulture, "Error while merging files. Document index {0}", pointer), oxmle);
}
catch (Exception e)
{
throw new Exception(string.Format(CultureInfo.CurrentCulture, "Error while merging files. Document index {0}", pointer), e);
}
finally
{
ret = destinationFile.ToArray();
destinationFile.Close();
destinationFile.Dispose();
}
return ret;
}
}
The problem here is that the formatting is copied from the first document and applied to all the rest, meaning that for instance a different header in the second document will be ignored. How do I prevent this?
I have been looking in to breaking the document in to sections using SectionMarkValues.NextPage, as well as using altChunk.
The problem with the latter is altChunk does not seem to be able to handle a MemoryStream into its "FeedData" method.

DocIO is a .NET library that can read, write, merge and render Word 2003/2007/2010/2013/2016 files. The whole suite of controls is available for free (commercial applications also) through the community license program if you qualify. The community license is the full product with no limitations or watermarks.
Step 1: Create a console application
Step 2: Add reference to Syncfusion.DocIO.Base, Syncfusion.Compression.Base and Syncfusion.OfficeChart.Base; You can add these reference to your project using NuGet also.
Step 3: Copy & paste the following code snippet.
This code snippet will produce the document as per your requirement; each input Word document will get merged with its original formatting, styles and headers/footer.
using Syncfusion.DocIO.DLS;
using Syncfusion.DocIO;
using System.IO;
namespace DocIO_MergeDocument
{
class Program
{
static void Main(string[] args)
{
//Boolean to indicate whether any of the input document has different odd and even headers as true
bool isDifferentOddAndEvenPagesEnabled = false;
// Creating a new document.
using (WordDocument mergedDocument = new WordDocument())
{
//Get the files from input directory
DirectoryInfo dirInfo = new DirectoryInfo(System.Environment.CurrentDirectory + #"\..\..\Data");
FileInfo[] fileInfo = dirInfo.GetFiles();
for (int i = 0; i < fileInfo.Length; i++)
{
if (fileInfo[i].Extension == ".doc" || fileInfo[i].Extension == ".docx")
{
using (WordDocument sourceDocument = new WordDocument(fileInfo[i].FullName))
{
//Check whether the document has different odd and even header/footer
if (!isDifferentOddAndEvenPagesEnabled)
{
foreach (WSection section in sourceDocument.Sections)
{
isDifferentOddAndEvenPagesEnabled = section.PageSetup.DifferentOddAndEvenPages;
if (isDifferentOddAndEvenPagesEnabled)
break;
}
}
//Sets the breakcode of First section of source document as NoBreak to avoid imported from a new page
sourceDocument.Sections[0].BreakCode = SectionBreakCode.EvenPage;
//Imports the contents of source document at the end of merged document
mergedDocument.ImportContent(sourceDocument, ImportOptions.KeepSourceFormatting);
}
}
}
//if any of the input document has different odd and even headers as true then
//Copy the content of the odd header/foort and add the copied content into the even header/footer
if (isDifferentOddAndEvenPagesEnabled)
{
foreach (WSection section in mergedDocument.Sections)
{
section.PageSetup.DifferentOddAndEvenPages = true;
if (section.HeadersFooters.OddHeader.Count > 0 && section.HeadersFooters.EvenHeader.Count == 0)
{
for (int i = 0; i < section.HeadersFooters.OddHeader.Count; i++)
section.HeadersFooters.EvenHeader.ChildEntities.Add(section.HeadersFooters.OddHeader.ChildEntities[i].Clone());
}
if (section.HeadersFooters.OddFooter.Count > 0 && section.HeadersFooters.EvenFooter.Count == 0)
{
for (int i = 0; i < section.HeadersFooters.OddFooter.Count; i++)
section.HeadersFooters.EvenFooter.ChildEntities.Add(section.HeadersFooters.OddFooter.ChildEntities[i].Clone());
}
}
}
//If there is no document to merge then add empty section with empty paragraph
if (mergedDocument.Sections.Count == 0)
mergedDocument.EnsureMinimal();
//Saves the document in the given name and format
mergedDocument.Save("result.docx", FormatType.Docx);
}
}
}
}
Downloadable Demo
Note: There is a Word document (not section) level settings for
applying different header/footer for odd and even pages. Each input
document can have different values for this property. if any of the
input document has different odd and even header/footer as true, it
will affect the visual appearance of header/footer in the resultant
document. Hence, if any of the input document has different odd and
even header/footer, then the resultant Word document will have been
replaced with the odd header/footer contents.
For further information about DocIO, please refer our help documentation
Note: I work for Syncfusion

getting URLs from PDF using PDFNet

We're using the PDFNet library to extract the contents of a PDF file. One of the things we need to do is extract the URLs in the PDF. Unfortunately, as you scan through the elements in the file, you get the URL in pieces, and it's not always clear which piece goes with which.
What is the best way to get complete URLs from PDFNet?

Links are stored on the pages as annotations. You can do something like the following code to get the URI from the annotation. The try/catch block is there because if any of the values are missing, they still return an Obj object, but you cannot call any method on it without it throwing.
Also, be aware that not everything that looks like a link is the same. We created two PDFs from the same Word file. The first we created with print to PDF. The second we created from within Acrobat.
The links in both files work fine with Acrobat Reader, but only the second file has annotations that PDFNet can see.
Page page = doc.GetPage(1);
for (int i = 1; j < page.GetNumAnnots(); j++) {
Annot annot = page.GetAnnot(i);
if (!annot.IsValid())
continue;
var sdf = annot.GetSDFObj();
string uri = ParseURI(sdf);
Console.WriteLine(uri);
}
private string ParseURI(pdftron.SDF.Obj obj) {
try {
if (obj.IsDict()) {
var aDictionary = obj.Find("A").Value();
var uri = aDictionary.Find("URI").Value();
return uri.GetAsPDFText();
}
} catch (Exception ) {
return null;
}
return null;
}

How do I extract attachments from a pdf file?

I have a big number pdf documents with xml files attached to them. I would like to extract those attached xml files and read them. How can I do this programatically using .net?

iTextSharp is also quite capable of extracting attachments... Though you might have to use the low level objects to do so.
There are two ways to embed files in a PDF:
In a File Annotation
At the document level "EmbeddedFiles".
Once you have a file specification dictionary from either source, the file itself will be a stream within the dictionary labeled "EF" (embedded file).
So to list all the files at the document level, one would write code (in Java) as such:
Map<String, byte[]> files = new HashMap<String,byte[]>();
PdfReader reader = new PdfReader(pdfPath);
PdfDictionary root = reader.getCatalog();
PdfDictionary names = root.getAsDict(PdfName.NAMES); // may be null
PdfDictionary embeddedFilesDict = names.getAsDict(PdfName.EMBEDDEDFILES); //may be null
PdfArray embeddedFiles = embeddedFilesDict.getAsArray(PdfName.NAMES); // may be null
int len = embeddedFiles.size();
for (int i = 0; i < len; i += 2) {
PdfString name = embeddedFiles.getAsString(i); // should always be present
PdfDictionary fileSpec = embeddedFiles.getAsDict(i+1); // ditto
PdfDictionary streams = fileSpec.getAsDict(PdfName.EF);
PRStream stream = null;
if (streams.contains(PdfName.UF))
stream = (PRStream)streams.getAsStream(PdfName.UF);
else
stream = (PRStream)streams.getAsStream(PdfName.F); // Default stream for backwards compatibility
if (stream != null) {
files.put( name.toUnicodeString(), PdfReader.getStreamBytes((PRStream)stream));
}
}

This is an old question, nonetheless I think my alternative solution (using PDF Clown) may be of some interest as it's way much cleaner (and more complete, as it iterates both at document and page level) than the code fragments previously proposed:
using org.pdfclown.bytes;
using org.pdfclown.documents;
using org.pdfclown.documents.files;
using org.pdfclown.documents.interaction.annotations;
using org.pdfclown.objects;
using System;
using System.Collections.Generic;
void ExtractAttachments(string pdfPath)
{
Dictionary<string, byte[]> attachments = new Dictionary<string, byte[]>();
using(org.pdfclown.files.File file = new org.pdfclown.files.File(pdfPath))
{
Document document = file.Document;
// 1. Embedded files (document level).
foreach(KeyValuePair<PdfString,FileSpecification> entry in document.Names.EmbeddedFiles)
{EvaluateDataFile(attachments, entry.Value);}
// 2. File attachments (page level).
foreach(Page page in document.Pages)
{
foreach(Annotation annotation in page.Annotations)
{
if(annotation is FileAttachment)
{EvaluateDataFile(attachments, ((FileAttachment)annotation).DataFile);}
}
}
}
}
void EvaluateDataFile(Dictionary<string, byte[]> attachments, FileSpecification dataFile)
{
if(dataFile is FullFileSpecification)
{
EmbeddedFile embeddedFile = ((FullFileSpecification)dataFile).EmbeddedFile;
if(embeddedFile != null)
{attachments[dataFile.Path] = embeddedFile.Data.ToByteArray();}
}
}
Note that you don't have to bother with null pointer exceptions as PDF Clown provides all the necessary abstraction and automation to ensure smooth model traversal.
PDF Clown is an LGPL 3 library, implemented both in Java and .NET platforms (I'm its lead developer): if you want to get it a try, I suggest you to check out its SVN repository on sourceforge.net as it keeps evolving.

Look for ABCpdf-Library, very easy and fast in my opinion.

What I got working is slightly different then anything else I have seen online.
So, just in case, I thought I would post this here to help someone else. I had to go through many different iterations to figure out - the hard way - what I needed to get it to work.
I am merging two PDFs into a third PDF, where one of the first two PDFs may have file attachments that need to be carried over into the third PDF. I am working completely in streams with ASP.NET, C# 4.0, ITextSharp 5.1.2.0.
// Extract Files from Submit PDF
Dictionary<string, byte[]> files = new Dictionary<string, byte[]>();
PdfDictionary names;
PdfDictionary embeddedFiles;
PdfArray fileSpecs;
int eFLength = 0;
names = writeReader.Catalog.GetAsDict(PdfName.NAMES); // may be null, writeReader is the PdfReader for a PDF input stream
if (names != null)
{
embeddedFiles = names.GetAsDict(PdfName.EMBEDDEDFILES); //may be null
if (embeddedFiles != null)
{
fileSpecs = embeddedFiles.GetAsArray(PdfName.NAMES); //may be null
if (fileSpecs != null)
{
eFLength = fileSpecs.Size;
for (int i = 0; i < eFLength; i++)
{
i++; //objects are in pairs and only want odd objects (1,3,5...)
PdfDictionary fileSpec = fileSpecs.GetAsDict(i); // may be null
if (fileSpec != null)
{
PdfDictionary refs = fileSpec.GetAsDict(PdfName.EF);
foreach (PdfName key in refs.Keys)
{
PRStream stream = (PRStream)PdfReader.GetPdfObject(refs.GetAsIndirectObject(key));
if (stream != null)
{
files.Add(fileSpec.GetAsString(key).ToString(), PdfReader.GetStreamBytes(stream));
}
}
}
}
}
}
}

You may try Aspose.Pdf.Kit for .NET. The PdfExtractor class allows you to extract attachments with the help of two methods: ExtractAttachment and GetAttachment. Please see an example of attachment extraction.
Disclosure: I work as developer evangelist at Aspose.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Text extraction using itext7: garbage characters for some pdf documents - c#

Related

Reading PDF in net core with itext7 returns "\n\n\n\n\n...."

Unable to merge 2 PDFs using MemoryStream

Merging Word documents using WordProcessingDocument

getting URLs from PDF using PDFNet

How do I extract attachments from a pdf file?

Categories

Resources