We're using the PDFNet library to extract the contents of a PDF file. One of the things we need to do is extract the URLs in the PDF. Unfortunately, as you scan through the elements in the file, you get the URL in pieces, and it's not always clear which piece goes with which.
What is the best way to get complete URLs from PDFNet?
Links are stored on the pages as annotations. You can do something like the following code to get the URI from the annotation. The try/catch block is there because if any of the values are missing, they still return an Obj object, but you cannot call any method on it without it throwing.
Also, be aware that not everything that looks like a link is the same. We created two PDFs from the same Word file. The first we created with print to PDF. The second we created from within Acrobat.
The links in both files work fine with Acrobat Reader, but only the second file has annotations that PDFNet can see.
Page page = doc.GetPage(1);
for (int i = 1; j < page.GetNumAnnots(); j++) {
Annot annot = page.GetAnnot(i);
if (!annot.IsValid())
continue;
var sdf = annot.GetSDFObj();
string uri = ParseURI(sdf);
Console.WriteLine(uri);
}
private string ParseURI(pdftron.SDF.Obj obj) {
try {
if (obj.IsDict()) {
var aDictionary = obj.Find("A").Value();
var uri = aDictionary.Find("URI").Value();
return uri.GetAsPDFText();
}
} catch (Exception ) {
return null;
}
return null;
}
Related
I have a problem extracting text from pdf documents using iText7. For documents coming from a specific source textRenderInfo.GetText() returns only garbage chars (0xfdff) in the event handler of my extraction strategy:
internal class CustomExtractionStrategy : ITextExtractionStrategy
{
public virtual void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals((object)EventType.RENDER_TEXT))
{
return;
}
var textRenderInfo = (TextRenderInfo)data;
bool currentResultEmpty = _result.Length == 0;
bool isInNewLine = false;
var baseline = textRenderInfo.GetBaseline();
var startPoint = baseline.GetStartPoint();
var endPoint = baseline.GetEndPoint();
var currentText = textRenderInfo.GetText(); // returns garbage for specific pdfs
// further processing below
...
}
}
I'm not very familiar with the way text/glyph encoding words in PDF but I try to give some details when comparing the problematic pdfs with an example where extraction works. For the pdfs with issues:
textRenderInfo.gs.font is MS-UIGothic
textRenderInfo.gs.font.fontProgram.codeToGlyph contains only mapping (key: 0 to a Glyph with width 1000, unicode -1, code 0)
textRenderInfo.gs.font.fontProgram.unicodeToGlyph contains no records
These are the most obvious discrepancies. If there's any thing else I should look out for please let me know. I would have provided an example of the PDF in question but it might have sensitive information that I must not disclose.
Note: the PDFs can be correctly read in Acrobat Reader and I can copy text from the reader into notepad. Other libraries (pdfium based or ports of PDFBox) can properly extract text from the document. So I think the document as such is "valid".
If this is a known issue for iText7, is there any workaround (other than using a different library altogether)?
Update
With the link provided in the comment and the following code (in addition to the custom extraction strategy snippet shown above) I get garbage chars see VS screenshot:
internal class PdfExtractor
{
internal void ExtractFromPath(string path)
{
PdfReader reader = new PdfReader(path);
var document = new iText.Kernel.Pdf.PdfDocument(reader);
for (int pageNum = 1; pageNum <= document.GetNumberOfPages(); pageNum++)
{
var page = document.GetPage(pageNum);
string text = PdfTextExtractor.GetTextFromPage(page, new CustomExtractionStrategy());
}
}
}
i have a netcore 3 app to read and split a PDF containing paychecks of some companies which i am working for.
This app ran pretty well since last builds... my the way, the PDF reader started to fail to parse the contents of any PDF.
PDF is built only with Italian words, no special chars. Few tables and a single logo. I'm not able to attach it due to privacy.
public PaycheckSplitter Read()
{
using (var reader = new PdfReader(new MemoryStream(this._stream)))
{
var doc = new PdfDocument(reader);
this.Paycheck = new PaychecksCollection();
for (int i = 1; i <= doc.GetNumberOfPages(); i++)
{
PdfPage page = doc.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, new LocationTextExtractionStrategy());
if (text.Contains(Consts.BpEnd)) break;
// trying to find something by regex... btw text contains only a sequence of \n\n\n\n...
string cf = Consts.CodFiscale.Match(text).Value;
this.Paychecks.Add(new Paycheck(cf), i);
}
doc.Close();
}
return this;
}
Anything i can do?
As far as i can see... the only and best way to have something to read a PDF text for free is iText7...
I have a c# class that takes an HTML and converts it to PDF using wkhtmltopdf.
As you will see below, I am generating 3 PDFs - Landscape, Portrait, and combined of the two.
The properties object contains the html as a string, and the argument for landscape/portrait.
System.IO.MemoryStream PDF = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file = new System.IO.FileStream("abc_landscape.pdf", System.IO.FileMode.Create);
PDF.Position = 0;
properties.IsHorizontalOrientation = false;
System.IO.MemoryStream PDF_portrait = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file_portrait = new System.IO.FileStream("abc_portrait.pdf", System.IO.FileMode.Create);
PDF_portrait.Position = 0;
System.IO.MemoryStream finalStream = new System.IO.MemoryStream();
PDF.CopyTo(finalStream);
PDF_portrait.CopyTo(finalStream);
System.IO.FileStream file_combined = new System.IO.FileStream("abc_combined.pdf", System.IO.FileMode.Create);
try
{
PDF.WriteTo(file);
PDF.Flush();
PDF_portrait.WriteTo(file_portrait);
PDF_portrait.Flush();
finalStream.WriteTo(file_combined);
finalStream.Flush();
}
catch (Exception)
{
throw;
}
finally
{
PDF.Close();
file.Close();
PDF_portrait.Close();
file_portrait.Close();
finalStream.Close();
file_combined.Close();
}
The PDFs "abc_landscape.pdf" and "abc_portrait.pdf" generate correctly, as expected, but the operation fails when I try to combine the two in a third pdf (abc_combined.pdf).
I am using MemoryStream to preform the merge, and at the time of debug, I can see that the finalStream.length is equal to the sum of the previous two PDFs. But when I try to open the PDF, I see the content of just 1 of the two PDFs.
The same can be seen below:
Additionally, when I try to close the "abc_combined.pdf", I am prompted to save it, which does not happen with the other 2 PDFs.
Below are a few things that I have tried out already, to no avail:
Change CopyTo() to WriteTo()
Merge the same PDF (either Landscape or Portrait one) with itself
In case it is required, below is the elaboration of the GetPdfStream() method.
var htmlStream = new MemoryStream();
var writer = new StreamWriter(htmlStream);
writer.Write(htmlString);
writer.Flush();
htmlStream.Position = 0;
return htmlStream;
Process process = Process.Start(psi);
process.EnableRaisingEvents = true;
try
{
process.Start();
process.BeginErrorReadLine();
var inputTask = Task.Run(() =>
{
htmlStream.CopyTo(process.StandardInput.BaseStream);
process.StandardInput.Close();
});
// Copy the output to a memorystream
MemoryStream pdf = new MemoryStream();
var outputTask = Task.Run(() =>
{
process.StandardOutput.BaseStream.CopyTo(pdf);
});
Task.WaitAll(inputTask, outputTask);
process.WaitForExit();
// Reset memorystream read position
pdf.Position = 0;
return pdf;
}
catch (Exception ex)
{
throw ex;
}
finally
{
process.Dispose();
}
Merging pdf in C# or any other language is not straight forward with out using 3rd party library.
I assume your requirement for not using library is that most Free libraries, nuget packages has limitation or/and cost money for commercial use.
I have made research and found you an Open Source library called PdfClown with nuget package, it is also available for Java. It is Free with out limitation (donate if you like). The library has a lot of features. One such you can merge 2 or more documents to one document.
I supply my example that take a folder with multiple pdf files, merged it and save it to same or another folder. It is also possible to use MemoryStream, but I do not find it necessary in this case.
The code is self explaining, the key point here is using SerializationModeEnum.Incremental:
public static void MergePdf(string srcPath, string destFile)
{
var list = Directory.GetFiles(Path.GetFullPath(srcPath));
if (string.IsNullOrWhiteSpace(srcPath) || string.IsNullOrWhiteSpace(destFile) || list.Length <= 1)
return;
var files = list.Select(File.ReadAllBytes).ToList();
using (var dest = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(files[0])))
{
var document = dest.Document;
var builder = new org.pdfclown.tools.PageManager(document);
foreach (var file in files.Skip(1))
{
using (var src = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(file)))
{ builder.Add(src.Document); }
}
dest.Save(destFile, SerializationModeEnum.Incremental);
}
}
To test it
var srcPath = #"C:\temp\pdf\input";
var destFile = #"c:\temp\pdf\output\merged.pdf";
MergePdf(srcPath, destFile);
Input examples
PDF doc A and PDF doc B
Output example
Links to my research:
https://csharp-source.net/open-source/pdf-libraries
https://sourceforge.net/projects/clown/
https://www.oipapio.com/question-3526089
Disclaimer: A part of this answer is taken from my my personal web site https://itbackyard.com/merge-multiple-pdf-files-to-one-pdf-file-in-c/ with source code to github.
This answer from Stack Overflow (Combine two (or more) PDF's) by Andrew Burns works for me:
using (PdfDocument one = PdfReader.Open("pdf 1.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument two = PdfReader.Open("pdf 2.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument outPdf = new PdfDocument())
{
CopyPages(one, outPdf);
CopyPages(two, outPdf);
outPdf.Save("file1and2.pdf");
}
void CopyPages(PdfDocument from, PdfDocument to)
{
for (int i = 0; i < from.PageCount; i++)
{
to.AddPage(from.Pages[i]);
}
}
That's not quite how PDFs work. PDFs are structured files in a specific format.
You can't just append the bytes of one to the other and expect the result to be a valid document.
You're going to have to use a library that understands the format and can do the operation for you, or developing your own solution.
PDF files aren't just text and images. Behind the scenes there is a strict file format that describes things like PDF version, the objects contained in the file and where to find them.
In order to merge 2 PDFs you'll need to manipulate the streams.
First you'll need to conserve the header from only one of the files. This is pretty easy since it's just the first line.
Then you can write the body of the first page, and then the second.
Now the hard part, and likely the part that will convince you to use a library, is that you have to re-build the xref table. The xref table is a cross reference table that describes the content of the document and more importantly where to find each element. You'd have to calculate the byte offset of the second page, shift all of the elements in it's xref table by that much, and then add it's xref table to the first. You'll also need to ensure you create objects in the xref table for the page break.
Once that's done, you need to re-build the document trailer which tells an application where the various sections of the document are among other things.
See https://resources.infosecinstitute.com/pdf-file-format-basic-structure/
This is not trivial and you'll end up re-writing lots of code that already exists.
As per question Remove unused image objects
I was told I'd effectively have to parse a PDF file, take note of the global object names, then remove those not in use.
I would not have even an inkling of where to start.
I was having a look in VS2010 local viewer and I could see in a page there was an array called Matrix. This seems to contain the XObjects in use in the page. But Matrix does not seem to be a property that the API allows.
I also found in my reader an xrefObj array, which seems to be every object. WHen looking at the XObjects i found a number of PRStream objects which corresponded in size to teh actual images.
iTextSharp.text.pdf.PdfDictionary dictionary = reader.GetPageN(i);
iTextSharp.text.pdf.PdfImportedPage page = pdfCpy.GetImportedPage(reader, i);
iTextSharp.text.pdf.PdfDictionary res = (iTextSharp.text.pdf.PdfDictionary)iTextSharp.text.pdf.PdfReader.GetPdfObject(dictionary.Get(iTextSharp.text.pdf.PdfName.RESOURCES));
iTextSharp.text.pdf.PdfDictionary xobj = (iTextSharp.text.pdf.PdfDictionary)iTextSharp.text.pdf.PdfReader.GetPdfObject(res.Get(iTextSharp.text.pdf.PdfName.XOBJECT));
foreach (iTextSharp.text.pdf.PdfName name in xobj.Keys)
{
iTextSharp.text.pdf.PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
iTextSharp.text.pdf.PdfDictionary tg = (iTextSharp.text.pdf.PdfDictionary)iTextSharp.text.pdf.PdfReader.GetPdfObject(obj);
iTextSharp.text.pdf.PdfName type = (iTextSharp.text.pdf.PdfName)iTextSharp.text.pdf.PdfReader.GetPdfObject(tg.Get(iTextSharp.text.pdf.PdfName.SUBTYPE));
if (iTextSharp.text.pdf.PdfName.IMAGE.Equals(type))
{
int XrefIndex = Convert.ToInt32(((iTextSharp.text.pdf.PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
iTextSharp.text.pdf.PdfObject pdfObj = reader.GetPdfObject(XrefIndex);
iTextSharp.text.pdf.PdfStream pdfStream = (iTextSharp.text.pdf.PdfStream)pdfObj;
}
}
}
This block, seems to give me the entire catalog of resources - as opposed to the specific used ones on the page.
So I guess what I'm asking for is:
-How can I match what is actually in my PDF file (I assume I make a list of all the ObjNum) references on each actual page of my File against the master list that is held in the reader.
-Remove all references that are not kept in my references list and save in place (this is a temporary file so in place would be fine.
Thanks in advance.
So to identify the images on the page, I used a PdfReaderContentParser.
iTextSharp.text.pdf.parser.PdfReaderContentParser parser = new iTextSharp.text.pdf.parser.PdfReaderContentParser(reader);
MyImageRenderListener listener = new MyImageRenderListener();
while (i < numberofPages)
{
i++;
parser.ProcessContent(i, listener);
}
The MyImageRenderListener is a new class inheriting: iTextSharp.text.pdf.parser.IRenderListener
All I did was add all the names to a list that's accessible from the original class:"
public void RenderImage(iTextSharp.text.pdf.parser.ImageRenderInfo renderInfo)
{
iTextSharp.text.pdf.parser.PdfImageObject image = renderInfo.GetImage();
if (image == null) return;
ImageNames.Add(renderInfo.GetRef().Number);
}
The I used the code posted in the original question as the basis of the master image list.
I am trying to embed pdf files into OPEN XML document. This requires creating *.bin files. I dont want to use automation.
Approach which Ive taken from this question works for all file types Ive tested except *.pdf.
For some reason pdf files always get the result from OleCreateFromFile(..) to be 0x80004005 and the pOle is NULL.
I am new on the field of invoking and OLE. What could be a reason for this approach not working for PDF?
(I have newest Adobe Reader, Win8, invoking into Ole32.dll, projects build target is x86 and Ive test to call CoUninitialize() and CoInitializeEx((System.IntPtr)null, OLE32.CoInit.ApartmentThreaded), I am able to embed pdf files in MSWORD application).
Here is a function that I use for it:
public static string ExportOleFile(string _inputFileName, string oleOutputFileName, string emfOutputFileName)
{
StringBuilder resultString = new StringBuilder();
string newInput = MultibyteToUnicodeNETOnly(_inputFileName, 1252);
Microsoft.VisualStudio.OLE.Interop.IStorage storage;
var result = OLE32.StgCreateStorageEx(oleOutputFileName,
Convert.ToInt32(OLE32.STGM.STGM_READWRITE | OLE32.STGM.STGM_SHARE_EXCLUSIVE | OLE32.STGM.STGM_CREATE | OLE32.STGM.STGM_TRANSACTED),
Convert.ToInt32(OLE32.STGFMT.STGFMT_DOCFILE),
0,
IntPtr.Zero,
IntPtr.Zero,
ref OLE32.IID_IStorage,
out storage
);//vytvoří bin
resultString.AppendLine("CreateStorageEx Result: " + result.ToString());
var CLSID_NULL = Guid.Empty;
Microsoft.VisualStudio.OLE.Interop.FORMATETC f = new FORMATETC();
Microsoft.VisualStudio.OLE.Interop.IOleObject pOle;
result = OLE32.OleCreateFromFile(
ref CLSID_NULL,
newInput,
ref OLE32.IID_IOleObject,
(uint)Microsoft.VisualStudio.OLE.Interop.OLERENDER.OLERENDER_NONE,
ref f,
null,
storage,
out pOle
);
resultString.AppendLine("OleCreateFromFile Result: " + result.ToString());
try
{
result = OLE32.OleRun(pOle);
}
catch (Exception ex)
{
resultString.AppendLine(ex.ToString());
return resultString.ToString();
}
resultString.AppendLine("OleRun Result: " + result.ToString());
try
{
IntPtr unknownFromOle = Marshal.GetIUnknownForObject(pOle);
IntPtr unknownForDataObj;
Marshal.QueryInterface(unknownFromOle, ref OLE32.IID_IDataObject, out unknownForDataObj);
var pdo = Marshal.GetObjectForIUnknown(unknownForDataObj) as System.Runtime.InteropServices.ComTypes.IDataObject;
var fetc = new System.Runtime.InteropServices.ComTypes.FORMATETC();
fetc.cfFormat = (short)OLE32.CLIPFORMAT.CF_ENHMETAFILE;
fetc.dwAspect = System.Runtime.InteropServices.ComTypes.DVASPECT.DVASPECT_CONTENT;
fetc.lindex = -1;
fetc.ptd = IntPtr.Zero;
fetc.tymed = System.Runtime.InteropServices.ComTypes.TYMED.TYMED_ENHMF;
var stgm = new System.Runtime.InteropServices.ComTypes.STGMEDIUM();
stgm.unionmember = IntPtr.Zero;
stgm.tymed = System.Runtime.InteropServices.ComTypes.TYMED.TYMED_ENHMF;
pdo.GetData(ref fetc, out stgm);
var hemf = GDI32.CopyEnhMetaFile(stgm.unionmember, emfOutputFileName);
storage.Commit((int)OLE32.STGC.DEFAULT);
pOle.Close(0);
GDI32.DeleteEnhMetaFile(stgm.unionmember);
GDI32.DeleteEnhMetaFile(hemf);
}
catch (Exception ex)
{
resultString.AppendLine(ex.ToString());
return resultString.ToString();
}
return resultString.ToString();
}
Actually for embedding files in OpenXML, it is necessary to work with the good old OLE functions. There is no other way around as you need to get two pieces:
a file that is going to be embedded
a picture that shows the content of the file, usually a screenshot of the first page
I did write a blog entry about that: Embedd pdf into powerpoint by usage of openxml. This is not exactly your requirement but it works identically.
There are two issues with pdfs when it comes to embedding:
Embedded pdf documents have a different content than the original pdf file. For all other OLE formats I know (excel, word, powerpoint, ...) this is not the case. You can just use the file on the hard disk, for pdf you cannot.
You need to take a picture of the first page. You could use pdfium or the like - there are quite some tools out there for rendering pdf, but adobe reader is free, and does the job 100%.