I am using iTextSharp for PDF processing, and I need to extract all text from an existing PDF that is written in a certain font.
A way to do that is to inherit from a RenderFilter and only allow text that has a certain PostscriptFontName. The problem is that when I do this, I see the following font names in the PDF:
CIDFont+F1
CIDFont+F2
CIDFont+F3
CIDFont+F4
CIDFont+F5
which is nothing like the actual font names I am looking for.
I have tried enumerating the font resources, and it shows the same result.
I have tried opening the PDF in the full Adobe Acrobat. It also shows the mangled font names:
I have tried analysing the file with iText RUPS. Same result.
That is, I have not been able to see the actual font names anywhere in the document structure.
Yet, Adobe Acrobat DC does show the correct font names in the Format pane when I select various text boxes on the document canvas (e.g. Arial, Courier New, Roboto), so that information must be stored somewhere.
How do I get those real font names when parsing PDFs with iTextSharp?
As determined in the course of the comments to the question, the font names are anonymized in all PDF metadata for the font but the embedded font program itself contains the actual font name.
(So the PDF strictly speaking is broken, even though in a way hardly any software will ever complain about.)
If we want to retrieve those names, therefore, we have to look inside these font programs.
Here a proof of concept following the architecture used in this answer you referenced, i.e. using a RenderFilter:
class FontProgramRenderFilter : RenderFilter
{
public override bool AllowText(TextRenderInfo renderInfo)
{
DocumentFont font = renderInfo.GetFont();
PdfDictionary fontDict = font.FontDictionary;
PdfName subType = fontDict.GetAsName(PdfName.SUBTYPE);
if (PdfName.TYPE0.Equals(subType))
{
PdfArray descendantFonts = fontDict.GetAsArray(PdfName.DESCENDANTFONTS);
PdfDictionary descendantFont = descendantFonts[0] as PdfDictionary;
PdfDictionary fontDescriptor = descendantFont.GetAsDict(PdfName.FONTDESCRIPTOR);
PdfStream fontStream = fontDescriptor.GetAsStream(PdfName.FONTFILE2);
byte[] fontData = PdfReader.GetStreamBytes((PRStream)fontStream);
MemoryStream dataStream = new MemoryStream(fontData);
dataStream.Position = 0;
MemoryPackage memoryPackage = new MemoryPackage();
Uri uri = memoryPackage.CreatePart(dataStream);
GlyphTypeface glyphTypeface = new GlyphTypeface(uri);
memoryPackage.DeletePart(uri);
ICollection<string> names = glyphTypeface.FamilyNames.Values;
return names.Where(name => name.Contains("Arial")).Count() > 0;
}
else
{
// analogous code for other font subtypes
return false;
}
}
}
The MemoryPackage class is from this answer which was my first find searching for how to read information from a font in memory using .Net.
Applied to your PDF file like this:
using (PdfReader pdfReader = new PdfReader(SOURCE))
{
FontProgramRenderFilter fontFilter = new FontProgramRenderFilter();
ITextExtractionStrategy strategy = new FilteredTextRenderListener(
new LocationTextExtractionStrategy(), fontFilter);
Console.WriteLine(PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy));
}
the result is
This is Arial.
Beware: This is a mere proof of concept.
On one hand you will surely also need to implement the part commented as analogous code for other font subtypes above; and even the TYPE0 part is not ready for production use as it only considers FONTFILE2 and does not handle null values gracefully.
On the other hand you will want to cache names for fonts already inspected.
Related
I've been attempting to find an easy solution to exporting a Canvas in my WPF Application to a PDF Document.
So far, the best solution has been to use the PrintDialog and set it up to automatically use the Microsoft Print the PDF 'printer'. The only problem I have had with this is that although the PrintDialog is skipped, there is a FileDialog to choose where the file should be saved.
Sadly, this is a deal-breaker because I would like to run this over a large number of canvases with automatically generated PDF names (well, programitically provided anyway).
Other solutions I have looked at include:
Using PrintDocument, but from my experimentation I would have to manually iterate through all my Canveses children and manually invoke the correct Draw method (of which a lot of my custom elements with transformation would be rather time consuming to do)
Exporting as a PNG image and then embedding that in a PDF. Although this works, TextBlocks within my canvas are no longer text. So this isn't an ideal situation.
Using the 3rd party library PDFSharp has the same downfall as the PrintDocument. A lot of custom logic for each element.
With PDFSharp. I did find a method fir generating the XGraphics from a Canvas but no way of then consuming that object to make a PDF Page
So does anybody know how I can skip or automate the PDF PrintDialog, or consume PDFSharp XGraphics to make
A page. Or any other ideas for directions to take this besides writing a whole library to convert each of my Canvas elements to PDF elements.
If you look at the output port of a recent windows installation of Microsoft Print To PDF
You may note it is set to PORTPROMP: and that is exactly what causes the request for a filename.
You might note lower down, I have several ports set to a filename, and the fourth one down is called "My Print to PDF"
So very last century methodology; when I print with a duplicate printer but give it a different name I can use different page ratios etc., without altering the built in standard one. The output for a file will naturally be built:-
A) Exactly in one repeatable location, that I can file monitor and rename it, based on the source calling the print sequence, such that if it is my current default printer I can right click files to print to a known \folder\file.pdf
B) The same port can be used via certain /pt (printto) command combinations to output, not just to that default port location, but to a given folder\name such as
"%ProgramFiles%\Windows NT\Accessories\WORDPAD.EXE" /pt listIN.doc "My Print to PDF" "My Print to PDF" "listOUT.pdf"
Other drivers usually charge for the convenience of WPF programmable renaming, but I will leave you that PrintVisual challenge for another of your three wishes.
MS suggest XPS is best But then they would be promoting it as a PDF competitor.
It does not need to be Doc[X]2PDF it could be [O]XPS2PDF or aPNG2PDF or many pages TIFF2PDF etc. etc. Any of those are Native to Win 10 also other 3rd party apps such as [Free]Office with a PrintTo verb will do XLS[X]2PDF. Imagination becomes pagination.
I had a great success in generating PDFs using PDFSharp in combination with SkiaSharp (for more advanced graphics).
Let me begin from the very end:
you save the PdfDocument object in the following way:
PdfDocument yourDocument = ...;
string filename = #"your\file\path\document.pdf"
yourDocument.Save(filename);
creating the PdfDocument with a page can be achieved the following way (adjust the parameters to fit your needs):
PdfDocument yourDocument = new PdfDocument();
yourDocument.PageLayout = PdfPageLayout.SinglePage;
yourDocument.Info.Title = "Your document title";
PdfPage yourPage = yourDocument.AddPage();
yourDocument.Orientation = PageOrientation.Landscape;
yourDocument.Size = PageSize.A4;
the PdfPage object's content (as an example I'm putting a string and an image) is filled in the following way:
using (XGraphics gfx = XGraphics.FromPdfPage(yourPage))
{
XFont yourFont = new XFont("Helvetica", 20, XFontStyle.Bold);
gfx.DrawString(
"Your string in the page",
yourFont,
XBrushes.Black,
new XRect(0, XUnit.FromMillimeter(10), page.Width, yourFont.GetHeight()),
XStringFormats.Center);
using (Stream s = new FileStream(#"path\to\your\image.png", FileMode.Open))
{
XImage image = XImage.FromStream(s);
var imageRect = new XRect()
{
Location = new XPoint() { X = XUnit.FromMillimeter(42), Y = XUnit.FromMillimeter(42) },
Size = new XSize() { Width = XUnit.FromMillimeter(42), Height = XUnit.FromMillimeter(42.0 * image.PixelHeight / image.PixelWidth) }
};
gfx.DrawImage(image, imageRect);
}
}
Of course, the font objects can be created as static members of your class.
And this is, in short to answer your question, how you consume the XGraphics object to create a PDF page.
Let me know if you need more assistance.
I'm given to read a pdf texts and do some stuffs are extracting the texts. I 'm using iTextSharp to read the PDF. The problem here is that the PdfTextExtractor.GetTextFromPage doesnt give me all the contents of the page. For ex
In the above PDF I m unable to read texts that are highlighted in blue. Rest of the characters I m able t read. Below is the line that does the above
`string filePath = "myFile path";
PdfReader pdfReader = new PdfReader(filePath);
for (int page = 1; page<=1; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
}`
Any suggestions here?
I have went through lots of queries and solution in SO but not specific to this query.
The reason for text extraction not extracting those texts is pretty simple: Those texts are not part of the static page content but form fields! But "Text extraction" in iText (and other PDF libraries I know, too) is considered to mean "extraction of the text of the static page content". Thus, those texts you miss simply are not subject to text extraction.
If you want to make form field values subject to your text extraction code, too, you first have to flatten the form field visualizations. "Flattening" here means making them part of the static page content and dropping all their form field dynamics.
You can do that by adding after reading the PDF in this line
PdfReader pdfReader = new PdfReader(filePath);
code to flatten this PDF and loading the flattened PDF into the pdfReader, e.g. like this:
MemoryStream memoryStream = new MemoryStream();
PdfStamper pdfStamper = new PdfStamper(pdfReader, memoryStream);
pdfStamper.FormFlattening = true;
pdfStamper.Writer.CloseStream = false;
pdfStamper.Close();
memoryStream.Position = 0;
pdfReader = new PdfReader(memoryStream);
Extracting the text from this re-initialized pdfReader will give you the text from the form fields, too.
Unfortunately, the flattened form text is added at the end of the content stream. As your chosen text extraction strategy SimpleTextExtractionStrategy simply returns the text in the order it is drawn, the former form fields contents all are extracted at the end.
You can change this by using a different text extraction strategy, i.e. by replacing this line:
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
Using the LocationTextExtractionStrategy (which is part of the iText distribution) already returns a better result; unfortunately the form field values are not exactly on the same base line as the static contents we perceive to be on the same line, so there are some unexpected line breaks.
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
Using the HorizontalTextExtractionStrategy (from this answer which contains both a Java and a C# version thereof) the result is even better. Beware, though, this strategy is not universally better, read the warnings in the answer text.
ITextExtractionStrategy strategy = new HorizontalTextExtractionStrategy();
I have currently a problem with PdfSharp/MigraDoc and a pdf viewer. I have used the EZFontResolver made by Thomas to be able to generate pdfs with custom fonts. Unfortunately the pdf viewer is unable to render the font, and I have no idea why. I have seen a bug described by Travis on Thomas' blog, which noted, that if EZFontResolver doesn't have multiple bold/italic symbol recognition (for example "fontname|b|b"), than PdfDocumentRenderer.RenderDocument() fails. The point is, when I try something like this:
Document document = DdlReader.DocumentFromString(ddl);
_renderer = new DocumentRenderer(document);
_renderer.PrepareDocument();
than the EZFontResolver is being asked for fonts with names like "customfont|b|b" (it doesn't happen when I use only PdfDocument.Save(...)) instead of "customfont".
My pdf viewer overrides DocumentViewer and views FixedDocument class instances. The funny thing is that the saved pdf file has all the fonts set, but the preview is unable to do that (and that is my big problem). All of this happens even though I return the right font with the resolver.
EDIT:
The ddl is a string which looks something like this:
"\\document
[
Info
{
Title = \"My file\"
Subject = \"My pdf file\"
Author = \"mikes\"
}
]
{
\\styles
{
Heading1 : Normal
{
Font
{
Name = \"My custom font\"
Bold = true
}
ParagraphFormat
{
Alignment = Center
SpaceBefore = \"0.5cm\"
SpaceAfter = \"0.5cm\"
}
}
header : Normal
{
Font
{
Name = \"My custom font\"
Size = 6
}
ParagraphFormat
{
Alignment = Center
}
}
And when I deleted the bug fix by Travis, the exception was thrown in the _renderer.PrepareDocument() (after fix, the stack trace showed that the source of multiple "|b" was also out of there).
Simulated bold and simulated italics use the regular font, but a transformation is applied.
Therefore the simulation will not work if the PDF viewer does not support those transformations.
The DocumentViewer that comes with MigraDoc does not display PDF files, it displays MigraDoc documents. For technical reasons it cannot use fonts supplied via the IFontResolver interface. EZFontResolver is an implementation of IFontResolver.
With respect to "customfont|b|b": I cannot say whether this is a bug or the regular behaviour. Please provide an MCVE (complete sample) if you think it is a bug.
I have a ready made PDF, and I would need to modify the trimbox, bleedbox with SetBoxSize and use the setPDFXConformance. Is there a way to do this?
I've tried with stamper.Writer, but it doesn't care about what I set there
2011.02.01.
We've tested it with Acrobat Pro, and it said that the trimbox was not defined. It seems the the stamper's writer's methods/properties don't effect the resulting pdf. Here are the source and result files: http://stemaweb.hu/pdfs.zip
my code:
PdfReader reader = new PdfReader(#"c:\source.pdf");
PdfStamper stamper = new PdfStamper(reader, new FileStream(#"c:\result.pdf", FileMode.Create));
stamper.Writer.SetPageSize(PageSize.A4);
stamper.Writer.PDFXConformance = PdfWriter.PDFX32002;
stamper.Writer.SetBoxSize("trim", new iTextSharp.text.Rectangle(20, 20, 100, 100));
PdfContentByte cb = stamper.GetOverContent(1);
/*drawing*/
stamper.Close();
Because the boxes are not visible, I tried to modify the pagesize with the writer but that didn't do anything either.
SetPDFXConformance won't turn a "normal" PDF into a PDF/X pdf. SetPDFXConformance is really just for document generation, causing iText to throw an exception if you do something blatantly off spec.
"it doesn't care about what I set there". Trim and bleed boxes are not something you can see visually in Reader. How are you testing for them?
Could you post some code, and a link to your output PDF?
Ah. You're using stamper.Writer. In this case, that doesn't work out so well. All the page level, Well Supported Actions via PdfStamper will take a page number or page's PdfDictionary as an argument. SetBoxSize just takes a string & a rectangle, so that's youre clue.
Going "under the hood" as you are is actually defaulting back to PdfWriter.setBoxSize... which is only for creating PDFs, not modifying an existing page.
So: You need to use the low-level PDF objects make the changes you want. No Problemo:
for (int i = 1; i <= myReader.getNumberOfPages(); ++i) {
PdfDictionary pageDict = myREADER_YES_READER.getPageN(i);
PdfRectangle newBox = new PdfRectangle( 20, 20, 100, 100 );
pageDict.put(PdfName.TRIMBOX, newBox);
newBox = new PdfRectangle( PageSize.A4 );
pageDict.put(PdfName.MEDIABOX, newBox );
}
/* drawing */
stamper.close();
As to the PDFX32002 conformance, I think you're going to have to go code diving to figure out exactly what is needed. Writer.PDFXConformance is another aspect of Writer that only works when generating a PDF, not modifying an existing one.
The good news is that PdfXConformanceImp is a public class. The bad news is that its only used internally by PdfWriter and PdfContentByte... hey. You are getting some changes in behavior with your present code (just not enough). Specifically, if you try something that isn't allowed within that PdfContentByte, you'll get a PdfXConformanceException with message describing the restriction you've violated. Trying to add an optional content group (layer) would throw for example.
Ah. That's not so bad. MAYBE. Try this:
PDFXConformanceImp pdfx = new PDFXConformanceImp();
pdfx.setConformance(PdfWriter.PDFX32002);
pdfx.commpleteInfoDictionary(stamper.Writer.getInfo());
pdfx.completeExtraCatalog(stamper.Writer.getExtraCatalog());
stamper.close();
If you drop stamper.Writer.PDFXConformance = PdfWriter.PDFX32002;, you won't get exceptions when you do something Forbidden in your contentByte. Other than that, I don't think it'll matter.
Hmm.. That's not the whole solution. The OutputIntents from the extraCatalog are merged into the main catalog as well. Perhaps this will work:
//replace the completeExtraCatalog call above with this
pdfx.completeExtraCatalog(myReader.getCatalog());
I wish you luck.
I'm not sure that this is possible but I figured it would be worth asking. I have figured out how to set the font of a formfield using the pdfstamper and acrofields methods but I would really like to be able to set the font of different parts of the text in the same field. Here's how I'm setting the font of the form fields currently:
// Use iTextSharp PDF Reader, to get the fields and send to the
//Stamper to set the fields in the document
PdfReader pdfReader = new PdfReader(fileName);
// Initialize Stamper (ms is a MemoryStream object)
PdfStamper pdfStamper = new PdfStamper(pdfReader, ms);
// Get Reference to PDF Document Fields
AcroFields pdfFormFields = pdfStamper.AcroFields;
//create a bold font
iTextSharp.text.Font bold = FontFactory.GetFont(FontFactory.COURIER, 8f, iTextSharp.text.Font.BOLD);
//set the field to bold
pdfFormFields.SetFieldProperty(nameOfField, "textfont", bold.BaseFont, null);
//set the text of the form field
pdfFormFields.SetField(nameOfField, "This: Will Be Displayed In The Field");
// Set the flattening flag to false, so the document can continue to be edited
pdfStamper.FormFlattening = true;
// close the pdf stamper
pdfStamper.Close();
What I'd like to be able to do where I set the text above is set the "This: " to bold and leave the "Will Be Displayed In The Field" non-bolded. I'm not sure this is actually possible but I figured it was worth asking because it would really be helpful in what I'm currently working on.
Thanks in advance!
Yes, kinda. PDF fields can have a rich text value (since acrobat 6/pdf1.5) along with a regular value.
The regular value uses the font defined in the default appearances... a single font.
The rich value format (which iText doesn't support directly, at least not yet), is described in chapter 12.7.3.4 of the PDF Reference. <b>, <i>, <p>, and quite a few css2 text attributes. It requires a with various attributes.
To enable rich values, you have to set bit 26 of the field flags (PdfName.FF) for a text field. PdfFormField doesn't have a "setRichValue", but they're dictionaries, so you can just:
myPdfFormField.put(PdfName.RV, new PdfString( richTextValue ) );
If you're trying to add rich text to an existing field that doesn't already support it:
AcroFields fields = stamper.getAcroFields();
AcroFields.Item fldItem = fields.getFieldItem(fldName);
PdfDictionary mergedDict = item.getMerged(0);
int flagVal = mergedDict.getAsNumber(PdfName.FF).intValue();
flagVal |= (1 << 26);
int writeFlags = AcroFields.Item.WRITE_MERGED | AcroFields.Item.WRITE_VALUE;
fldItem.writeToAll(PdfName.FF, new PdfNumber(flagVal), writeFlags);
fldItem.writeToAll(PdfName.RV, new PdfString(richTextValue), writeFlags);
I'm actually adding rich text support to iText (not sharp) as I type this message. Hurray for contributors on SO. Paulo's been good about keeping iTextSharp in synch lately, so that shouldn't be an issue. The next trunk release should have this feature... so you'd be able to write:
myPdfFormField.setFieldFlags( PdfFormField.FF_RICHTEXT );
myPdfFormField.setRichValue( richTextValue );
or
// note that this will fail unless the rich flag is set
acroFields.setFieldRichValue( richTextValue );
NOTE: iText's appearance generation hasn't been updated, just the value side of things. That would take considerably more work. So you'll want to acroFields.setGenerateAppearances(false) or have JS that resets the field value when the form its opened to force Acrobat/Reader to build the appearance[s] for you.
It took me some time to figure out after richtextfield did not work the way it was suppose too with
acrofields and most of the cases were the pdf was not editable with different fonts at runtime.
I worked out way of setting different fonts in acrofields and passing values and editing at runtime with itextsharp and I thought it will be useful for others.
Create pdf with a text field in PDF1.pdf (I hope you know how to create field in pdf)
e.g., txtComments
Go to the property section and Set the property to richtext,Mulitiline
Format the text content in word or pdf by adding the fonts and colors.
If done in word, copy and paste the content in pdf - txtcomments field.
Note:
If you want to add dynamic content. Set the parameter “{0}” to the txtComment field in the pdf.
using string format method you can set values to it.This is shown in the code below.
e.g., "Set different parts of a form field to have different fonts using {0}"
Add the following code in a button (this is not specific) event by reference in the itextsharp.dll 5.4.2
Response.ContentType = "application/pdf";
Response.AddHeader("Content-disposition","attachment; filename=your.pdf");
PdfReader reader = new PdfReader(#"C:\test\pdf1.pdf");
PdfStamper stamp = new PdfStamper(reader, Response.OutputStream);
AcroFields field = pdfStamp.AcroFields;
string comments = field .GetFieldRichValue("txtcomments");
string Name = "Test1";
string value = string.Format(comments,Name);
field.SetField("txtComment", value );
field.GenerateAppearances = false;//Pdf knows what to do;
stamp.FormFlattening = false;//available for edit at run time
stamp.FreeTextFlattening = true;
stamp.Close();
reader.Close()
Hope this helps.