Fill pdf template fields with cyrillic values (itextsharp)

Fill pdf template fields with cyrillic values (itextsharp) - c#

I have PDF template file with fields.
Template created by customer. It has some text, field labels and fields itself. Text and labels uses some font which is embedded within the template.
Problems occur when I try to fill fields with cyrillic values - there is no cyrillic symbols in result document.
I saw a lot of similar problems which were solved by using substitution font for AcroFields. But here I can't use one specific font for substitution, because I can't define field font in template.
I tried to set different fonts for fields in Acrobat Editor - Times New Roman, Arial and other well known Windows fonts, but there is no effect in resulting pdf.
Code sample:
FontFactory.RegisterDirectory(Environment.GetFolderPath(Environment.SpecialFolder.Fonts));
using (var dest = File.Create(#"result.pdf"))
{
using (var stamper = new PdfStamper(reader, dest))
{
var fields = stamper.AcroFields;
fields.SetField("ClientName", "Имя клиента");
stamper.FormFlattening = true;
stamper.Close();
}
}
I even registered all available fonts in FontFactory, but there was no effect.
So the questions are:
1. If I can embed font in Adobe Acrobat used for fields only, then how to do it?
2. If I can define font family for existing field with iTextSharp, then how to do it?

Well, I wrote solution suitable for me.
Register all existing system fonts in FontFactory.
Read document metadata to extract all used fonts in documents.
Read fields metadata and try to create BaseFont suitable to field font. If there is no suitable font - use fallback font (arial with encoding IDENTITY_H).
So full code looks like:
static IEnumerable<PdfFontInfo> ReadDocumentFonts(PdfReader reader)
{
if (reader.AcroForm == null)
yield break;
var dr = reader.AcroForm.GetAsDict(PdfName.DR);
// Read font information from resources
var fontDict = dr.GetAsDict(PdfName.FONT);
foreach (var fontKey in fontDict.Keys)
{
var data = fontDict.GetAsDict(fontKey);
// Read font descriptor if it possible
var descriptor = data.GetAsDict(PdfName.FONTDESCRIPTOR);
if (descriptor != null)
{
// Read font name and family
var family = descriptor.GetAsString(PdfName.FONTFAMILY);
yield return new PdfFontInfo(fontKey, family.ToUnicodeString());
}
}
}
static IReadOnlyList<BaseFont> CreateSubstitutionFontsForFields(PdfReader reader)
{
if (reader.AcroForm.Fields == null)
return new List<BaseFont>(0);
var documentFontMap = ReadDocumentFonts(reader).ToDictionary(f => f.Name, StringComparer.InvariantCultureIgnoreCase);
var substFonts = new Dictionary<string, BaseFont>();
var fallbackRequired = false;
// Read font information of each field
foreach (var field in reader.AcroForm.Fields)
{
var fieldFontDa = field.Info.GetAsString(PdfName.DA);
if (fieldFontDa == null)
continue;
var parts = AcroFields.SplitDAelements(fieldFontDa.ToUnicodeString());
if (parts.Length == 0)
continue;
var fontName = (string) parts[0];
PdfFontInfo inf;
if (documentFontMap.TryGetValue(fontName, out inf))
{
if (!substFonts.ContainsKey(fontName))
{
var font = FontFactory.GetFont(fontName, BaseFont.IDENTITY_H, true).BaseFont;
substFonts.Add(fontName, font);
}
}
else
fallbackRequired = true;
}
var allFonts = new List<BaseFont>(substFonts.Values);
if (fallbackRequired)
allFonts.Add(FALLBACK_FONT);
return allFonts;
}
If you can find any errors, you are welcome to comment.

Related

Text extraction using itext7: garbage characters for some pdf documents

I have a problem extracting text from pdf documents using iText7. For documents coming from a specific source textRenderInfo.GetText() returns only garbage chars (0xfdff) in the event handler of my extraction strategy:
internal class CustomExtractionStrategy : ITextExtractionStrategy
{
public virtual void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals((object)EventType.RENDER_TEXT))
{
return;
}
var textRenderInfo = (TextRenderInfo)data;
bool currentResultEmpty = _result.Length == 0;
bool isInNewLine = false;
var baseline = textRenderInfo.GetBaseline();
var startPoint = baseline.GetStartPoint();
var endPoint = baseline.GetEndPoint();
var currentText = textRenderInfo.GetText(); // returns garbage for specific pdfs
// further processing below
...
}
}
I'm not very familiar with the way text/glyph encoding words in PDF but I try to give some details when comparing the problematic pdfs with an example where extraction works. For the pdfs with issues:
textRenderInfo.gs.font is MS-UIGothic
textRenderInfo.gs.font.fontProgram.codeToGlyph contains only mapping (key: 0 to a Glyph with width 1000, unicode -1, code 0)
textRenderInfo.gs.font.fontProgram.unicodeToGlyph contains no records
These are the most obvious discrepancies. If there's any thing else I should look out for please let me know. I would have provided an example of the PDF in question but it might have sensitive information that I must not disclose.
Note: the PDFs can be correctly read in Acrobat Reader and I can copy text from the reader into notepad. Other libraries (pdfium based or ports of PDFBox) can properly extract text from the document. So I think the document as such is "valid".
If this is a known issue for iText7, is there any workaround (other than using a different library altogether)?
Update
With the link provided in the comment and the following code (in addition to the custom extraction strategy snippet shown above) I get garbage chars see VS screenshot:
internal class PdfExtractor
{
internal void ExtractFromPath(string path)
{
PdfReader reader = new PdfReader(path);
var document = new iText.Kernel.Pdf.PdfDocument(reader);
for (int pageNum = 1; pageNum <= document.GetNumberOfPages(); pageNum++)
{
var page = document.GetPage(pageNum);
string text = PdfTextExtractor.GetTextFromPage(page, new CustomExtractionStrategy());
}
}
}

change existing font in pdf file itextsharp c#

I want to change existing font style regular to bold, increase font size. Like if font style is regular I want to change it to bold. If font size 10 then i want to increase or decrease one size it(10->11 or 10->9)
After searching on this topic I found this code but this gives only the information of font and it doesn't change style and size
string OutputFile = "font.pdf";
//PdfReader pdfReader = new PdfReader(strFile);
PdfReader pdfReader = new PdfReader(mStream.ToArray());
//Get first page,Generally we get font information on first page,however we can loop throw pages e.g for(int i=0;i<=pdfReader.NumberOfPages;i++)
PdfDictionary cpage = pdfReader.GetPageN(1);
if (cpage == null)
return;
PdfDictionary dictFonts = cpage.GetAsDict(PdfName.RESOURCES).GetAsDict(PdfName.FONT);
if (dictFonts != null)
{
foreach (var font in dictFonts)
{
var dictFontInfo = dictFonts.GetAsDict(font.Key);
if (dictFontInfo != null)
{
//Get the font name-optional code
var baseFont = dictFontInfo.Get(PdfName.BASEFONT);
string strFontName = System.Text.Encoding.ASCII.GetString(baseFont.GetBytes(), 0, baseFont.Length);
//var bf = BaseFont.CreateFont((PRIndirectReference)baseFont);
//iTextSharp.text.Font exFont =new iTextSharp.text.Font(bf,20f);
//Remove the current font
//dictFontInfo.Remove(PdfName.BASEFONT);
//Set new font eg. Braille, Areal etc
//dictFontInfo.Put(PdfName.BASEFONT, new PdfString("Braille"));
}
}
}
//Now create a new document with updated font
using (FileStream FS = new FileStream(OutputFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document Doc = new Document())
{
using (PdfCopy writer = new PdfCopy(Doc, FS))
{
Doc.Open();
for (int j = 1; j <= pdfReader.NumberOfPages; j++)
{
writer.AddPage(writer.GetImportedPage(pdfReader, j));
}
Doc.Close();
}
}
}
pdfReader.Close();
i want also to change some font like Arial to some other font.

Changing the font of an existing PDF can not be done in a meaningful, generic way, without risk of messing up the layout.
To illustrate, assume you have the following text.
I'm using | to indicate a page-boundary.
Lorem Ipsum Dolor |
Sit Amet Consectetur|
Nunc |
If I make this text larger, or make it bold, or even italic, it is likely to take up more space. That means the word 'Consectetur' will no longer fit on the line.
PDF (unlike a Word document) does not automatically re-flow its content. The content would simply appear to go over the page boundary (and depending on the viewer you are using it might vanish).
The real problem is that the PDF format does not have the same information as the word format.
where are word-boundaries located?
where are paragraph boundaries?
(what language is this text being written in?)
All of these are important when performing layout for a document. And none of these are naturally present in a PDF document.

PDFsharp: Replace a string using PDFsharp

This question is already present but doesn't provide the answer using PDFsharp but iTextPDF.
Now coming back to question, I know a way to read and extract the String. But I'm having trouble REPLACING the text.
My Code:
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
text = text.Replace("Replace This", "With This");
XFont font = new XFont("Times New Roman", 11, XFontStyle.BoldItalic);
gfx.DrawString(text, font, XBrushes.Black, new XRect(0, 0, page.Width, page.Height), XStringFormats.Left);
// Save the document...
const string filename = "New Doc.pdf";
document.Save(filename);
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = cObject as CString;
yield return cString.Value;
}
}
This is a sample code and this one would ignore the graphics and images. And end up writing only text in the output file. Is there way I can replace the text without touching Graphics and Images in the content?

The sample seems to be a wrong approach: it returns text only, but ignores graphics, images, and even text positions and text attributes.
You can try to locate the text instructions (TJ, Tj) in the content and replace them with new instructions (also TJ or Tj) without touching anything else in the stream. Such a simple approach would lead to overlapping text or large gaps if the new text has a different lengths.
PDFsharp was not designed to parse the content streams. You have to write your own code to extract text, you have to write your own code to modify text (or use a third-party library that was built on PDFsharp).
To answer your question: yes, there is a way (as outlined above), but you will have to write a whole lot of code to achieve this (or find suitable code written by a third party).

Does itextsharp supports courier new font or any other registered font in c#

I have pdf template ready with me.On which write values matching keys on pdf.Need to set different font and font size based on requireemnt Using itextsharp(pdfstamper).
But I want to set various diffent font like CourierNew,Arial and some other third party registered font.How do I use those different fonts using itextsharp.
Please refer following code snippet.
Following code snippet used to write values on pdf template
To set font call function SetPrintFont where different font and font size defined.
Please guide on the same
var pdfReader = new PdfReader(filebyte);
var pdfStamper = new PdfStamper(pdfReader, new FileStream(pdfname, FileMode.Create));
AcroFields pdfFormFields = pdfStamper.AcroFields;
foreach (DictionaryEntry de in pdfReader.AcroFields.Fields)
{
//set the field to bold
pdfFormFields.SetFieldProperty(de.Key.ToString(), "textfont", font.BaseFont, null);
Regex regex = new Regex(#"^\d$");
if (regex.IsMatch(de.Key.ToString()))
{
//set the text of the form field
pdfFormFields.SetField(de.Key.ToString(), response.ResponseValues.ToString());
}
else
{
pdfFormFields.SetField(de.Key.ToString(), response.ResponseValues.ToString());
}
}
pdfStamper.FormFlattening = false;
pdfStamper.Close();
}

You can get the list of registerred fonts like this:
ICollection<string> registeredFonts = iTextSharp.text.FontFactory.RegisteredFonts;
Anyway in iTextSharp you can use any font you want. If you don't find desired font you can download .ttf file from the internet and attach it:
BaseFont baseFont = BaseFont.CreateFont(fontFolderPath + "arial.ttf", BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED)
Font font = iTextSharp.text.Font(baseFont, fontSize, iTextSharp.text.Font.BOLD);
BaseFont is a member of iTextSharp.text.pdf

how can I put a content in a mergefield in docx

I'm developing a web application with asp.net and I have a file called Template.docx that works like a template to generate other reports. Inside this Template.docx I have some MergeFields (Title, CustomerName, Content, Footer, etc) to replace for some dynamic content in C#.
I would like to know, how can I put a content in a mergefield in docx ?
I don't know if MergeFields is the right way to do this or if there is another way. If you can suggest me, I appreciate!
PS: I have openxml referenced in my web application.
Edits:
private MemoryStream LoadFileIntoStream(string fileName)
{
MemoryStream memoryStream = new MemoryStream();
using (FileStream fileStream = File.OpenRead(fileName))
{
memoryStream.SetLength(fileStream.Length);
fileStream.Read(memoryStream.GetBuffer(), 0, (int) fileStream.Length);
memoryStream.Flush();
fileStream.Close();
}
return memoryStream;
}
public MemoryStream GenerateWord()
{
string templateDoc = "C:\\temp\\template.docx";
string reportFileName = "C:\\temp\\result.docx";
var reportStream = LoadFileIntoStream(templateDoc);
// Copy a new file name from template file
//File.Copy(templateDoc, reportFileName, true);
// Open the new Package
Package pkg = Package.Open(reportStream, FileMode.Open, FileAccess.ReadWrite);
// Specify the URI of the part to be read
Uri uri = new Uri("/word/document.xml", UriKind.Relative);
PackagePart part = pkg.GetPart(uri);
XmlDocument xmlMainXMLDoc = new XmlDocument();
xmlMainXMLDoc.Load(part.GetStream(FileMode.Open, FileAccess.Read));
// replace some keys inside xml (it will come from database, it's just a test)
xmlMainXMLDoc.InnerXml = xmlMainXMLDoc.InnerXml.Replace("field_customer", "My Customer Name");
xmlMainXMLDoc.InnerXml = xmlMainXMLDoc.InnerXml.Replace("field_title", "Report of Documents");
xmlMainXMLDoc.InnerXml = xmlMainXMLDoc.InnerXml.Replace("field_content", "Content of Document");
// Open the stream to write document
StreamWriter partWrt = new StreamWriter(part.GetStream(FileMode.Open, FileAccess.Write));
//doc.Save(partWrt);
xmlMainXMLDoc.Save(partWrt);
partWrt.Flush();
partWrt.Close();
reportStream.Flush();
pkg.Close();
return reportStream;
}
PS: When I convert MemoryStream to a file, I got a corrupted file. Thanks!

I know this is an old post, but I could not get the accepted answer to work for me. The project linked would not even compile (which someone has already commented in that link). Also, it seems to use other Nuget packages like WPFToolkit.
So I'm adding my answer here in case someone finds it useful. This only uses the OpenXML SDK 2.5 and also the WindowsBase v4. This works on MS Word 2010 and later.
string sourceFile = #"C:\Template.docx";
string targetFile = #"C:\Result.docx";
File.Copy(sourceFile, targetFile, true);
using (WordprocessingDocument document = WordprocessingDocument.Open(targetFile, true))
{
// If your sourceFile is a different type (e.g., .DOTX), you will need to change the target type like so:
document.ChangeDocumentType(WordprocessingDocumentType.Document);
// Get the MainPart of the document
MainDocumentPart mainPart = document.MainDocumentPart;
var mergeFields = mainPart.RootElement.Descendants<FieldCode>();
var mergeFieldName = "SenderFullName";
var replacementText = "John Smith";
ReplaceMergeFieldWithText(mergeFields, mergeFieldName, replacementText);
// Save the document
mainPart.Document.Save();
}
private void ReplaceMergeFieldWithText(IEnumerable<FieldCode> fields, string mergeFieldName, string replacementText)
{
var field = fields
.Where(f => f.InnerText.Contains(mergeFieldName))
.FirstOrDefault();
if (field != null)
{
// Get the Run that contains our FieldCode
// Then get the parent container of this Run
Run rFldCode = (Run)field.Parent;
// Get the three (3) other Runs that make up our merge field
Run rBegin = rFldCode.PreviousSibling<Run>();
Run rSep = rFldCode.NextSibling<Run>();
Run rText = rSep.NextSibling<Run>();
Run rEnd = rText.NextSibling<Run>();
// Get the Run that holds the Text element for our merge field
// Get the Text element and replace the text content
Text t = rText.GetFirstChild<Text>();
t.Text = replacementText;
// Remove all the four (4) Runs for our merge field
rFldCode.Remove();
rBegin.Remove();
rSep.Remove();
rEnd.Remove();
}
}
What the code above does is basically this:
Identify the 4 Runs that make up the merge field named "SenderFullName".
Identify the Run that contains the Text element for our merge field.
Remove the 4 Runs.
Update the text property of the Text element for our merge field.
UPDATE
For anyone interested, here is a simple static class I used to help me with replacing merge fields.

Frank Fajardo's answer was 99% of the way there for me, but it is important to note that MERGEFIELDS can be SimpleFields or FieldCodes.
In the case of SimpleFields, the text runs displayed to the user in the document are children of the SimpleField.
In the case of FieldCodes, the text runs shown to the user are between the runs containing FieldChars with the Separate and the End FieldCharValues. Occasionally, several text containing runs exist between the Separate and End Elements.
The code below deals with these problems. Further details of how to get all the MERGEFIELDS from the document, including the header and footer is available in a GitHub repository at https://github.com/mcshaz/SimPlanner/blob/master/SP.DTOs/Utilities/OpenXmlExtensions.cs
private static Run CreateSimpleTextRun(string text)
{
Run returnVar = new Run();
RunProperties runProp = new RunProperties();
runProp.Append(new NoProof());
returnVar.Append(runProp);
returnVar.Append(new Text() { Text = text });
return returnVar;
}
private static void InsertMergeFieldText(OpenXmlElement field, string replacementText)
{
var sf = field as SimpleField;
if (sf != null)
{
var textChildren = sf.Descendants<Text>();
textChildren.First().Text = replacementText;
foreach (var others in textChildren.Skip(1))
{
others.Remove();
}
}
else
{
var runs = GetAssociatedRuns((FieldCode)field);
var rEnd = runs[runs.Count - 1];
foreach (var r in runs
.SkipWhile(r => !r.ContainsCharType(FieldCharValues.Separate))
.Skip(1)
.TakeWhile(r=>r!= rEnd))
{
r.Remove();
}
rEnd.InsertBeforeSelf(CreateSimpleTextRun(replacementText));
}
}
private static IList<Run> GetAssociatedRuns(FieldCode fieldCode)
{
Run rFieldCode = (Run)fieldCode.Parent;
Run rBegin = rFieldCode.PreviousSibling<Run>();
Run rCurrent = rFieldCode.NextSibling<Run>();
var runs = new List<Run>(new[] { rBegin, rCurrent });
while (!rCurrent.ContainsCharType(FieldCharValues.End))
{
rCurrent = rCurrent.NextSibling<Run>();
runs.Add(rCurrent);
};
return runs;
}
private static bool ContainsCharType(this Run run, FieldCharValues fieldCharType)
{
var fc = run.GetFirstChild<FieldChar>();
return fc == null
? false
: fc.FieldCharType.Value == fieldCharType;
}

You could try http://www.codeproject.com/KB/office/Fill_Mergefields.aspx which uses the Open XML SDK to do this.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Fill pdf template fields with cyrillic values (itextsharp) - c#

Related

Text extraction using itext7: garbage characters for some pdf documents

change existing font in pdf file itextsharp c#

PDFsharp: Replace a string using PDFsharp

Does itextsharp supports courier new font or any other registered font in c#

how can I put a content in a mergefield in docx

Categories

Resources