c# iTextSharp converting WMF to PDF embedding fonts

c# iTextSharp converting WMF to PDF embedding fonts - c#

I am fairly new to C#, but am using it to convert some older files in a WMF/PMF format to PDF. I have the script working, but some of the fonts in the original document are not coming through. For example, some of these old documents are payroll check runs and the checks use a special MICR font (where the account/routing numbers are displayed). This MICR line comes through the conversion as what appears to be a random base font.
I am simply using iTextSharp to convert the WMF to an image, then adding the image as a page in the PDF.
I have researched embedding fonts, but my problem is that I might not know what the original font name was. These files could be any number of things, and they could be quite old. Is there a way to include a font library that would expand iTextSharp's basic fonts so they are more likely to be recognized during the conversion?
This is the function performing the conversion. All of the WMF files (one for each check) are put in a directory and the function loads them all into a single PDF:
static bool BuildPDF(string pDecodeDir)
{
Document pdfDoc = new Document();
DirectoryInfo decodeDir = new DirectoryInfo(pDecodeDir);
int vCount = 0;
foreach (var file in decodeDir.GetFiles("*.WMF"))
{
try
{
iTextSharp.text.Image img1 = ImgWMF.GetInstance(file.FullName);
if (vCount == 0)
{
// in order to inherit the document size properly, we need to load the first image before creating the PDF object
pdfDoc = new Document(img1);
PdfWriter.GetInstance(pdfDoc, new FileStream(targetPath, FileMode.Create));
pdfDoc.Open();
}
else
{
pdfDoc.NewPage();
}
Console.WriteLine("Adding page {0}: {1}", vCount.ToString(), file.Name);
img1.SetAbsolutePosition(0, 0);
pdfDoc.Add(img1);
vCount++;
}
catch (System.Exception docerr)
{
Console.WriteLine("Doc Error: {0}", docerr.Message);
return false;
}
}
Console.WriteLine("{0} created!", targetPath);
pdfDoc.Close();
return true;
}

Related

Why are the images in this PDF file corrupt in some viewers?

We are using PDFsharp to gather sets of images from a folder and put them into PDF files, one image per page. For this certain set of images, the resulting PDF document appears corrupted when opening in certain viewers... Chrome is broken, Adobe Reader is broken, Edge gives up entirely, but Firefox actually renders it correctly.
Here is the relevant code:
public static List<string> ImageExtensions()
{
return new List<string>() { ".tif", ".tiff", ".png", ".jpg", ".jpeg", ".gif" };
}
public void Convert()
{
PdfDocument doc = new PdfDocument();
foreach(string fPath in FilePaths)
{
string ext = Path.GetExtension(fPath);
if (ImageExtensions().Contains(ext))
AddImageToPDF(fPath, ref doc);
}
try
{
doc.Save(OutputFilePath);
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
doc.Close();
doc.Dispose();
}
private void AddImageToPDF(string imagePath, ref PdfDocument doc)
{
Image MyImage = Image.FromFile(imagePath);
AddImageToPDF(MyImage, ref doc);
}
private void AddImageToPDF(Image image, ref PdfDocument doc)
{
try
{
int numPages = doc.Pages.Count;
using (Image MyImage = image)
{
for (int _pageIndex = 0; _pageIndex < MyImage.GetFrameCount(FrameDimension.Page); _pageIndex++)
{
MyImage.SelectActiveFrame(FrameDimension.Page, _pageIndex);
XImage img = XImage.FromGdiPlusImage(MyImage);
img.Interpolate = true;
var page = new PdfPage() { Orientation = img.PixelWidth > img.PixelHeight ? PageOrientation.Landscape : PageOrientation.Portrait };
doc.Pages.Add(page);
using (var xg = XGraphics.FromPdfPage(doc.Pages[_pageIndex + numPages]))
{
xg.DrawImage(img, 0, 0);
}
}
}
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
File set A (broken in most viewers). File set B (working in all viewers). Both sets are JPEGs that were converted from PNGs using the exact same mechanism (Irfanview).
Why do all of the viewers except Firefox render A incorrectly, and what can I do to fix that?

For images with just a single frame you should use XImage.FromFile instead of XImage.FromGdiPlusImage to work around a bug somewhere in recent versions of the Windows framework.
XImage.FromFile allows PDFsharp to access the original file while XImage.FromGdiPlusImage must rely on the information provided by the framework - and for certain JPEG images this information is not correct.
Make sure you are using the latest version of PDFsharp. Good questions indicate which version they refer to.
You should get correct PDF files if you run your code under Windows XP (but that ain't an option for production use, of course).

I am an idiot... updating to the latest version of PDFSharp fixed the issue. I had assumed we were already using the latest. >=/

Unable to merge 2 PDFs using MemoryStream

I have a c# class that takes an HTML and converts it to PDF using wkhtmltopdf.
As you will see below, I am generating 3 PDFs - Landscape, Portrait, and combined of the two.
The properties object contains the html as a string, and the argument for landscape/portrait.
System.IO.MemoryStream PDF = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file = new System.IO.FileStream("abc_landscape.pdf", System.IO.FileMode.Create);
PDF.Position = 0;
properties.IsHorizontalOrientation = false;
System.IO.MemoryStream PDF_portrait = new WkHtmlToPdfConverter().GetPdfStream(properties);
System.IO.FileStream file_portrait = new System.IO.FileStream("abc_portrait.pdf", System.IO.FileMode.Create);
PDF_portrait.Position = 0;
System.IO.MemoryStream finalStream = new System.IO.MemoryStream();
PDF.CopyTo(finalStream);
PDF_portrait.CopyTo(finalStream);
System.IO.FileStream file_combined = new System.IO.FileStream("abc_combined.pdf", System.IO.FileMode.Create);
try
{
PDF.WriteTo(file);
PDF.Flush();
PDF_portrait.WriteTo(file_portrait);
PDF_portrait.Flush();
finalStream.WriteTo(file_combined);
finalStream.Flush();
}
catch (Exception)
{
throw;
}
finally
{
PDF.Close();
file.Close();
PDF_portrait.Close();
file_portrait.Close();
finalStream.Close();
file_combined.Close();
}
The PDFs "abc_landscape.pdf" and "abc_portrait.pdf" generate correctly, as expected, but the operation fails when I try to combine the two in a third pdf (abc_combined.pdf).
I am using MemoryStream to preform the merge, and at the time of debug, I can see that the finalStream.length is equal to the sum of the previous two PDFs. But when I try to open the PDF, I see the content of just 1 of the two PDFs.
The same can be seen below:
Additionally, when I try to close the "abc_combined.pdf", I am prompted to save it, which does not happen with the other 2 PDFs.
Below are a few things that I have tried out already, to no avail:
Change CopyTo() to WriteTo()
Merge the same PDF (either Landscape or Portrait one) with itself
In case it is required, below is the elaboration of the GetPdfStream() method.
var htmlStream = new MemoryStream();
var writer = new StreamWriter(htmlStream);
writer.Write(htmlString);
writer.Flush();
htmlStream.Position = 0;
return htmlStream;
Process process = Process.Start(psi);
process.EnableRaisingEvents = true;
try
{
process.Start();
process.BeginErrorReadLine();
var inputTask = Task.Run(() =>
{
htmlStream.CopyTo(process.StandardInput.BaseStream);
process.StandardInput.Close();
});
// Copy the output to a memorystream
MemoryStream pdf = new MemoryStream();
var outputTask = Task.Run(() =>
{
process.StandardOutput.BaseStream.CopyTo(pdf);
});
Task.WaitAll(inputTask, outputTask);
process.WaitForExit();
// Reset memorystream read position
pdf.Position = 0;
return pdf;
}
catch (Exception ex)
{
throw ex;
}
finally
{
process.Dispose();
}

Merging pdf in C# or any other language is not straight forward with out using 3rd party library.
I assume your requirement for not using library is that most Free libraries, nuget packages has limitation or/and cost money for commercial use.
I have made research and found you an Open Source library called PdfClown with nuget package, it is also available for Java. It is Free with out limitation (donate if you like). The library has a lot of features. One such you can merge 2 or more documents to one document.
I supply my example that take a folder with multiple pdf files, merged it and save it to same or another folder. It is also possible to use MemoryStream, but I do not find it necessary in this case.
The code is self explaining, the key point here is using SerializationModeEnum.Incremental:
public static void MergePdf(string srcPath, string destFile)
{
var list = Directory.GetFiles(Path.GetFullPath(srcPath));
if (string.IsNullOrWhiteSpace(srcPath) || string.IsNullOrWhiteSpace(destFile) || list.Length <= 1)
return;
var files = list.Select(File.ReadAllBytes).ToList();
using (var dest = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(files[0])))
{
var document = dest.Document;
var builder = new org.pdfclown.tools.PageManager(document);
foreach (var file in files.Skip(1))
{
using (var src = new org.pdfclown.files.File(new org.pdfclown.bytes.Buffer(file)))
{ builder.Add(src.Document); }
}
dest.Save(destFile, SerializationModeEnum.Incremental);
}
}
To test it
var srcPath = #"C:\temp\pdf\input";
var destFile = #"c:\temp\pdf\output\merged.pdf";
MergePdf(srcPath, destFile);
Input examples
PDF doc A and PDF doc B
Output example
Links to my research:
https://csharp-source.net/open-source/pdf-libraries
https://sourceforge.net/projects/clown/
https://www.oipapio.com/question-3526089
Disclaimer: A part of this answer is taken from my my personal web site https://itbackyard.com/merge-multiple-pdf-files-to-one-pdf-file-in-c/ with source code to github.

This answer from Stack Overflow (Combine two (or more) PDF's) by Andrew Burns works for me:
using (PdfDocument one = PdfReader.Open("pdf 1.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument two = PdfReader.Open("pdf 2.pdf", PdfDocumentOpenMode.Import))
using (PdfDocument outPdf = new PdfDocument())
{
CopyPages(one, outPdf);
CopyPages(two, outPdf);
outPdf.Save("file1and2.pdf");
}
void CopyPages(PdfDocument from, PdfDocument to)
{
for (int i = 0; i < from.PageCount; i++)
{
to.AddPage(from.Pages[i]);
}
}

That's not quite how PDFs work. PDFs are structured files in a specific format.
You can't just append the bytes of one to the other and expect the result to be a valid document.
You're going to have to use a library that understands the format and can do the operation for you, or developing your own solution.

PDF files aren't just text and images. Behind the scenes there is a strict file format that describes things like PDF version, the objects contained in the file and where to find them.
In order to merge 2 PDFs you'll need to manipulate the streams.
First you'll need to conserve the header from only one of the files. This is pretty easy since it's just the first line.
Then you can write the body of the first page, and then the second.
Now the hard part, and likely the part that will convince you to use a library, is that you have to re-build the xref table. The xref table is a cross reference table that describes the content of the document and more importantly where to find each element. You'd have to calculate the byte offset of the second page, shift all of the elements in it's xref table by that much, and then add it's xref table to the first. You'll also need to ensure you create objects in the xref table for the page break.
Once that's done, you need to re-build the document trailer which tells an application where the various sections of the document are among other things.
See https://resources.infosecinstitute.com/pdf-file-format-basic-structure/
This is not trivial and you'll end up re-writing lots of code that already exists.

Export arabic data into pdf asp.net showing error

When I export ARABIC data into pdf.Microsoft adobereader showing error.Adobe reader could not open file because it is either not a supported file.My code is following asp.net c#.Guide me
protected void btnExport_Click(object sender, EventArgs e)
{
Response.ContentType = "application/pdf";
Response.AddHeader("content-disposition", "attachment;filename=TestPage.pdf");
Document doc = new Document(PageSize.LETTER);
doc.Open();
//Sample HTML
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.Append(#"<p>This is a test: <strong>مسندم</strong></p>");
//Path to our font
string arialuniTff = Server.MapPath("~/tradbdo.TTF");
//Register the font with iTextSharp
iTextSharp.text.FontFactory.Register(arialuniTff);
//Create a new stylesheet
iTextSharp.text.html.simpleparser.StyleSheet ST = new iTextSharp.text.html.simpleparser.StyleSheet();
//Set the default body font to our registered font's internal name
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.FACE, "Traditional Arabic Bold");
//Set the default encoding to support Unicode characters
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, BaseFont.IDENTITY_H);
//Parse our HTML using the stylesheet created above
List<IElement> list = HTMLWorker.ParseToList(new StringReader(stringBuilder.ToString()), ST);
//Loop through each element, don't bother wrapping in P tags
foreach (var element in list)
{
doc.Add(element);
}
doc.Close();
Response.Write(doc);
Response.End();
}

I found the following article which shows how to correctly export and display Arabic content via the iTextSharp library: http://geekswithblogs.net/JaydPage/archive/2011/11/02/using-itextsharp-to-correctly-display-hebrew--arabic-text-right.aspx.
Here is the code sample that you can try:
using iTextSharp.text;
using iTextSharp.text.pdf;
using System.Text.RegularExpressions;
using System.IO;
using System.Diagnostics;
public void WriteDocument()
{
//Declare a itextSharp document
Document document = new Document(PageSize.A4);
//Create our file stream and bind the writer to the document and the stream
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(#"C:\Test.Pdf", FileMode.Create));
//Open the document for writing
document.Open();
//Add a new page
document.NewPage();
//Reference a Unicode font to be sure that the symbols are present.
BaseFont bfArialUniCode = BaseFont.CreateFont(#"C:\ARIALUNI.TTF", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
//Create a font from the base font
Font font = new Font(bfArialUniCode, 12);
//Use a table so that we can set the text direction
PdfPTable table = new PdfPTable(1);
//Ensure that wrapping is on, otherwise Right to Left text will not display
table.DefaultCell.NoWrap = false;
//Create a regex expression to detect hebrew or arabic code points
const string regex_match_arabic_hebrew = #"[\u0600-\u06FF,\u0590-\u05FF]+";
if (Regex.IsMatch("مسندم", regex_match_arabic_hebrew, RegexOptions.IgnoreCase))
{
table.RunDirection = PdfWriter.RUN_DIRECTION_RTL;
}
//Create a cell and add text to it
PdfPCell text = new PdfPCell(new Phrase("مسندم", font));
//Ensure that wrapping is on, otherwise Right to Left text will not display
text.NoWrap = false;
//Add the cell to the table
table.AddCell(text);
//Add the table to the document
document.Add(table);
//Close the document
document.Close();
//Launch the document if you have a file association set for PDF's
Process AcrobatReader = new Process();
AcrobatReader.StartInfo.FileName = #"C:\Test.Pdf";
AcrobatReader.Start();
}

The iTextSharp.text.Document is a class used to help bridge human concepts like Paragraph and Margin into PDF concepts. The bridge part is important. It is not a PDF file in any way so it should never be treated as a PDF. Doing so would be like treating System.Drawing.Graphics as if it were an image. This leads to one of your problems on the second to last line of code that tries to treat the Document as if it were a PDF by sending it directly to the output stream:
//This won't work
Response.Write(doc);
You will find many, many tutorials out there that do this and they are all wrong. Fortunately (or unfortunately), PDF is forgiving and allows junk data at the end so only a handful of PDF fail and people assume there was some other problem.
Your other problem is that you are missing a PdfWriter. If Document is the bridge, PdfWriter is the actual construction worker that puts that PDF together. It, however, is also not a PDF. Instead, it needs to be bound to a stream like a file, in-memory or the HttpResponse.OutputStream.
Below is some code that shows this off. I very strongly recommend separating your PDF logic from your ASPX logic. Do all of you PDF stuff first and get an actual "something" that represents a PDF, then do something with it.
At the beginning we declare a byte array that we'll fill in later. Next we create a System.IO.MemoryStream that will be used to write the PDF to. After creating the Document we then create a PdfWriter that's bound to the Document and our stream. Your internal code is the same and although I didn't test it it appears correct. Right before we're done with our MemoryStream we grab the active bytes into our byte array. Lastly we use the BinaryWrite() method to send our raw binary PDF to the requesting client.
//At the end of this bytes will hold a byte array representing an actual PDF file
Byte[] bytes;
//Create a simple in-memory stream
using (var ms = new MemoryStream()){
using (var doc = new Document()) {
//Create a new PdfWriter bound to our document and the stream
using (var writer = PdfWriter.GetInstance(doc, ms)) {
doc.Open();
//This is unchanged from the OP's code
//Sample HTML
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.Append(#"<p>This is a test: <strong>مسندم</strong></p>");
//Path to our font
string arialuniTff = Server.MapPath("~/tradbdo.TTF");
//Register the font with iTextSharp
iTextSharp.text.FontFactory.Register(arialuniTff);
//Create a new stylesheet
iTextSharp.text.html.simpleparser.StyleSheet ST = new iTextSharp.text.html.simpleparser.StyleSheet();
//Set the default body font to our registered font's internal name
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.FACE, "Traditional Arabic Bold");
//Set the default encoding to support Unicode characters
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, BaseFont.IDENTITY_H);
//Parse our HTML using the stylesheet created above
List<IElement> list = HTMLWorker.ParseToList(new StringReader(stringBuilder.ToString()), ST);
//Loop through each element, don't bother wrapping in P tags
foreach (var element in list) {
doc.Add(element);
}
doc.Close();
}
}
//Right before closing the MemoryStream grab all of the active bytes
bytes = ms.ToArray();
}
//We now have a valid PDF and can do whatever we want with it
//In this case, use BinaryWrite to send it directly to the requesting client
Response.ContentType = "application/pdf";
Response.AddHeader("content-disposition", "attachment;filename=TestPage.pdf");
Response.BinaryWrite(bytes);
Response.End();

iTextSharp won't use font unless it is installed

I have my PDF file being generated correctly with Chinese characters if I have the font installed in my system font directory. When I actually deploy this to the Azure website, I won't be able to install the font.
I added the font to the project and it is getting deployed, the application finds the path, but iTextSharp does not use it when the PDF is generated.
The current code which works -
FontFactory.Register("c:/windows/fonts/ARIALUNI.TTF");
This does not work but the path is good -
string arialuniTffFont = System.IO.Path.Combine(Server.MapPath("~/bin/fonts/arialuni.ttf"));
FontFactory.Register(arialuniTffFont);
UPDATED:
private void CreatePDF(IList<string> HTMLData, string fileName, bool rotate)
{
//Create PDF document
Document doc = new Document(PageSize.A4, 36, 36, 36, 36);
if (rotate)
{
doc.SetPageSize(iTextSharp.text.PageSize.A4.Rotate());
}
HTMLWorker parser = new HTMLWorker(doc);
string fontpath = Server.MapPath("/Fonts/arialuni.ttf");
FontFactory.Register(fontpath);
StyleSheet styles = new StyleSheet();
styles.LoadTagStyle(HtmlTags.TABLE, HtmlTags.SIZE, "6pt");
styles.LoadTagStyle(HtmlTags.H3, HtmlTags.SIZE, "10pt");
styles.LoadTagStyle(HtmlTags.H5, HtmlTags.SIZE, "6pt");
styles.LoadTagStyle(HtmlTags.BODY, HtmlTags.FACE, "Arial Unicode MS");
styles.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, BaseFont.IDENTITY_H);
parser.SetStyleSheet(styles);
PdfWriter.GetInstance(doc, new FileStream(fileName, FileMode.Create));
doc.Open();
//Try/Catch removed
foreach (var s in HTMLData) {
StringReader reader = new StringReader(s);
parser.Parse(reader);
doc.NewPage();
}
doc.Close();
}
The entire routine that does not produce Chinese characters.

The file that I was trying to use was the one that was directly downloaded from the internet. I did not realize that it needed to be extracted. Once I figured this out and got the correct file, it worked correctly without any code changes.

how to know for corrupted PDF file before merging using iTextSharp in C#

I am using iTextSharp to merge pdf pages.
But they might be some corrupted pdf.
My question is, how to verify programmatically whether the pdf is corrupted or not?

I usually check the header of a file to see what kind of file it is. A PDF header always starts with %PDF.
Ofcourse the file could be corrupted AFTER the header, then I am not really sure if there is any other way than just trying to open and read from the document. When the file is corrupted, opening OR reading from that document probably gives an exception. I am not sure iTextSharp throws all kinds of exceptions, but I think you can test that out.

One way, since you're merging files, is to wrap your code in a try...catch block:
Dictionary<string, Exception> errors =
new Dictionary<string, Exception>();
document.Open();
PdfContentByte cb = writer.DirectContent;
foreach (string filePath in testList) {
try {
PdfReader reader = new PdfReader(filePath);
int pages = reader.NumberOfPages;
for (int i = 0; i < pages; ) {
document.NewPage();
PdfImportedPage page = writer.GetImportedPage(reader, ++i);
cb.AddTemplate(page, 0, 0);
}
}
// **may** be PDF spec, but not supported by iText
catch (iTextSharp.text.exceptions.UnsupportedPdfException ue) {
errors.Add(filePath, ue);
}
// invalid according to PDF spec
catch (iTextSharp.text.exceptions.InvalidPdfException ie) {
errors.Add(filePath, ie);
}
catch (Exception e) {
errors.Add(filePath, e);
}
}
if (errors.Keys.Count > 0) {
document.NewPage();
foreach (string key in errors.Keys) {
document.Add(new Paragraph(string.Format(
"FILE: {0}\nEXCEPTION: [{1}]: {2}",
key, errors[key].GetType(), errors[key].Message
)));
}
}
where testList is a collection of file paths to the PDF documents you're merging.
On a separate note, you also need to consider what you define as corrupt. There are many PDF documents out there that do not meet PDF specs, but some readers (Adobe Reader) are smart enough to fix/repair them on the fly.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

c# iTextSharp converting WMF to PDF embedding fonts - c#

Related

Why are the images in this PDF file corrupt in some viewers?

Unable to merge 2 PDFs using MemoryStream

Export arabic data into pdf asp.net showing error

iTextSharp won't use font unless it is installed

how to know for corrupted PDF file before merging using iTextSharp in C#

Categories

Resources