Html to pdf some characters are missing (itextsharp)

Html to pdf some characters are missing (itextsharp) - c#

I want to export gridview to pdf by using the itextsharp library. The problem is that some turkish characters such as İ,ı,Ş,ş etc... are missing in the pdf document. The code used to export the pdf is:
protected void LinkButtonPdf_Click(object sender, EventArgs e)
{
Response.ContentType = "application/pdf";
Response.ContentEncoding = System.Text.Encoding.UTF8;
Response.AddHeader("content-disposition", "attachment;filename=FileName.pdf");
Response.Cache.SetCacheability(HttpCacheability.NoCache);
System.IO.StringWriter stringWrite = new StringWriter();
System.Web.UI.HtmlTextWriter htmlWrite = new HtmlTextWriter(stringWrite);
GridView1.RenderControl(htmlWrite);
StringReader reader = new StringReader(textConvert(stringWrite.ToString()));
Document doc = new Document(PageSize.A4);
HTMLWorker parser = new HTMLWorker(doc);
PdfWriter.GetInstance(doc, Response.OutputStream);
doc.Open();
parser.Parse(reader);
doc.Close();
}
public static string textConvert(string S)
{
if (S == null) { return null; }
try
{
System.Text.Encoding encFrom = System.Text.Encoding.UTF8;
System.Text.Encoding encTo = System.Text.Encoding.UTF8;
string str = S;
Byte[] b = encFrom.GetBytes(str);
return encTo.GetString(b);
}
catch { return null; }
}
Note: when I want to insert characters into the pdf document, the missing characters are shown in it. I insert the characters with this code:
BaseFont bffont = BaseFont.CreateFont("C:\\WINDOWS\\Fonts\\arial.ttf", BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
Font fontozel = new Font(bffont, 12, Font.NORMAL, new Color(0, 0, 0));
doc.Add(new Paragraph("İİııŞŞşşĞĞğğ", fontozel));

Finaly I think I found the solution,I changed itextsharp source code a little in order to show turkish characters.(turkish character code is cp1254)
I add "public const string CP1254 = "Cp1254";" to [BaseFont.cs] in the source code.
After that I modify the [FactoryProperties.cs].I changed like this;
public Font GetFont(ChainedProperties props)
{
I don't write the whole code.I changed only code below;
------------Default itextsharp code------------------------------------------------------
if (encoding == null)
encoding = BaseFont.WINANSI;
return fontImp.GetFont(face, encoding, true, size, style, color);
-------------modified code--------------------------------------------
encoding = BaseFont.CP1254;
return fontImp.GetFont("C:\\WINDOWS\\Fonts\\arial.ttf", encoding, true, size, style, color);
}
.After I compile new dll ,and missing characters are shown.

No need to change the source code.
Try this:
iTextSharp.text.pdf.BaseFont STF_Helvetica_Turkish = iTextSharp.text.pdf.BaseFont.CreateFont("Helvetica","Cp1254", iTextSharp.text.pdf.BaseFont.NOT_EMBEDDED);
iTextSharp.text.Font fontNormal = new iTextSharp.text.Font(STF_Helvetica_Turkish, 12, iTextSharp.text.Font.NORMAL);

thank you very much all who posted the samples..
i use the below solution from codeproject , and there was the turkish char set problems due to font..
If you use htmlworker you should register font and pass to htmlworker
http://www.codeproject.com/Articles/260470/PDF-reporting-using-ASP-NET-MVC3
StyleSheet styles = new iTextSharp.text.html.simpleparser.StyleSheet();
styles.LoadTagStyle("h3", "size", "5");
styles.LoadTagStyle("td", "size", ".6");
FontFactory.Register("c:\\windows\\fonts\\arial.ttf", "Garamond"); // just give a path of arial.ttf
styles.LoadTagStyle("body", "face", "Garamond");
styles.LoadTagStyle("body", "encoding", "Identity-H");
styles.LoadTagStyle("body", "size", "12pt");
using (var htmlViewReader = new StringReader(htmlText))
{
using (var htmlWorker = new HTMLWorker(pdfDocument, null, styles))
{
htmlWorker.Parse(htmlViewReader);
}
}

I am not familiar with the iTextSharp library; however, you seem to be converting the output of your gridview component to a string and reading from that string to construct your PDF document. You also have a strange conversion from UTF-8 to UTF-8 going on.
From what I can see (given that your GridView is outputting characters correctly) if you are outputting the characters to a string they would be represented as UTF-16 in memory. You probably need to pass this string directly into the PDF library (like how you pass the raw UTF-16 .NET string "İııŞŞşşĞĞğğ" as it is).

You can use:
iTextSharp.text.pdf.BaseFont Vn_Helvetica = iTextSharp.text.pdf.BaseFont.CreateFont(#"C:\Windows\Fonts\arial.ttf", "Identity-H", iTextSharp.text.pdf.BaseFont.EMBEDDED);
iTextSharp.text.Font fontNormal = new iTextSharp.text.Font(Vn_Helvetica, 12, iTextSharp.text.Font.NORMAL);

For Turkish encoding
CultureInfo ci = new CultureInfo("tr-TR");
Encoding enc = Encoding.GetEncoding(ci.TextInfo.ANSICodePage);
If you're outputting HTML, try different DOCTYPE tags at the top of the page.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
Note if using HTML you may need to HTMLEncode the characters.
Server.HTMLEncode()
HttpServerUtility.HtmlEncode()

BaseFont bF = BaseFont.CreateFont("c:\\arial.ttf","windows-1254",true);
Font f = new Font(bF,12f,Font.NORMAL);
Chunk c = new Chunk();
c.Font = f;
c.Append("Turkish characters: ĞÜŞİÖÇ ğüşıöç");
document.Add(c);
In the first line, you may write these instead of "windows-1254". All works:
Cp1254
iso-8859-9
windows-1254

Don't change the source code of the iTextSharp. Define a new style:
var styles = new StyleSheet();
styles.LoadTagStyle(HtmlTags.BODY, HtmlTags.FONTFAMILY, "tahoma");
styles.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, "Identity-H");
and then pass it to the HTMLWorker.ParseToList method.

i have finally find a soultution for this problem , by this you can print all turkish character.
String htmlText = html.ToString();
Document document = new Document();
string filePath = HostingEnvironment.MapPath("~/Content/Pdf/");
PdfWriter.GetInstance(document, new FileStream(filePath + "\\pdf-"+Name+".pdf", FileMode.Create));
document.Open();
iTextSharp.text.html.simpleparser.HTMLWorker hw = new iTextSharp.text.html.simpleparser.HTMLWorker(document);
FontFactory.Register(Path.Combine(_webHelper.MapPath("~/App_Data/Pdf/arial.ttf")), "Garamond"); // just give a path of arial.ttf
StyleSheet css = new StyleSheet();
css.LoadTagStyle("body", "face", "Garamond");
css.LoadTagStyle("body", "encoding", "Identity-H");
css.LoadTagStyle("body", "size", "12pt");
hw.SetStyleSheet(css);
hw.Parse(new StringReader(htmlText));

I strongly suggest not to change itextsharp source code in order to solve this problem. Have a look at my other comment on the subject: https://stackoverflow.com/a/24587745/1138663

I solved the problem. I can provide my the other solution type...
try
{
BaseFont bf = BaseFont.CreateFont("c:\\windows\\fonts\\calibrib.ttf",
BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
Document document = new Document(PageSize.A4, 25, 25, 30, 30);
PdfWriter writer = PdfWriter.GetInstance(document, fs);
Font f = new Font(bf, 12f, Font.NORMAL);
// Open the document to enable you to write to the document
document.Open();
// Add a simple and wellknown phrase to the document
for (int x = 0; x != 100; x++)
{
document.Add(new Paragraph("Paragraph - This is a test! ÇçĞğİıÖöŞşÜü",f));
}
// Close the document
document.Close();
}
catch(Exception)
{
}

Related

getting text from a pdf document itextsharp

i tried using iTextSharp to get the text from a pdf document,
it works great if the pdf file is with english text(latin chars).
If i try to get the text from a pdf doc with cyrillic characters the output is just question marks. Are there some settings to be made, or cyrillic isnt supported?
this is the code for creating the pdf:
string testText = "зззi";
string tmpFile = #"C:\items\test.pdf";
string myFont = #"C:\windows\fonts\verdana.ttf";
iTextSharp.text.Rectangle pgeSize = new iTextSharp.text.Rectangle(595, 792);
iTextSharp.text.Document doc = new iTextSharp.text.Document(pgeSize, 10, 10, 10, 10);
iTextSharp.text.pdf.PdfWriter wrtr;
wrtr = iTextSharp.text.pdf.PdfWriter.GetInstance(doc,
new System.IO.FileStream(tmpFile, System.IO.FileMode.Create));
doc.Open();
doc.NewPage();
iTextSharp.text.pdf.BaseFont bfR;
bfR = BaseFont.CreateFont(myFont, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
iTextSharp.text.BaseColor clrBlack =
new iTextSharp.text.BaseColor(0, 0, 0);
iTextSharp.text.Font fntHead =
new iTextSharp.text.Font(bfR, 34, iTextSharp.text.Font.NORMAL, clrBlack);
iTextSharp.text.Paragraph pgr =
new iTextSharp.text.Paragraph(testText, fntHead);
doc.Add(pgr);
doc.Close();
this is the code for retrieving the text:
PdfReader reader1
= new PdfReader("c:/items/test.pdf");
Console.WriteLine(PdfTextExtractor.GetTextFromPage(reader1, 1, new SimpleTextExtractionStrategy()));
Console.ReadLine();
the output is: ???i
EDIT 2
i managed to read text from the pdf i created, but still cant get the text from a random pdf. How can i check if that pdf provides the required info for text extraction?

itextsharp html to pdf

I want to change some HTML in a pdf. All my html is in HTML string but I don't know how to pass it in correctly within iTextSharp.
public void PDF()
{
// Create a doc object
var doc = new doc(PageSize.A4, 50, 50, 25, 25);
// Create a new PdfWrite object, writing the output to the file ~/PDFTemplate/SimpleFormFieldDemo.pdf
var output = new FileStream(Server.MapPath("t.pdf"), FileMode.Create);
var writer = PdfWriter.GetInstance(doc, output);
// Open the doc for writing
doc.Open();
//Add Wallpaper image to the pdf
var Wallpaper = iTextSharp.text.Image.GetInstance(Server.MapPath("hfc.png"));
Wallpaper.SetAbsolutePosition(0, 0);
Wallpaper.ScaleAbsolute(600, 840);
doc.Add(Wallpaper);
iTextSharp.text.html.simpleparser.HTMLWorker hw = new iTextSharp.text.html.simpleparser.HTMLWorker(doc);
StyleSheet css = new StyleSheet();
css.LoadTagStyle("body", "face", "Garamond");
css.LoadTagStyle("body", "encoding", "Identity-H");
css.LoadTagStyle("body", "size", "12pt");
hw.Parse(new StringReader(HTML));
doc.Close();
Response.Redirect("t.pdf");
}
If anyone knows how to make this work.. it be good.
Thanks
Dom

Please download The Best iText Questions on StackOverflow. It's a free ebook, you'll benefit from it.
Once you have downloaded is, go to the section entitled "Parsing XML and XHTML".
Allow me to quote from the answer to this question: RowSpan does not work in iTextSharp?
You are using HTMLWorker instead of XML Worker, and you are right:
HTMLWorker has no support for CSS. Saying CSS doesn't work in
iTextSharp is wrong. It doesn't work when you use HTMLWorker, but
that's documented: the CSS you need works in XML Worker.
Please throw away your code, and start anew using XML Worker.
There are many examples (simple ones as well as complex ones) in the book. Let me give you only one:
using (var fsOut = new FileStream(outputFile, FileMode.Create, FileAccess.Write))
using (var stringReader = new StringReader(result))
{
var document = new Document();
var pdfWriter = PdfWriter.GetInstance(document, fsOut);
pdfWriter.InitialLeading = 12.5f;
document.Open();
var xmlWorkerHelper = XMLWorkerHelper.GetInstance();
var cssResolver = new StyleAttrCSSResolver();
var xmlWorkerFontProvider = new XMLWorkerFontProvider();
foreach (string font in fonts)
{
xmlWorkerFontProvider.Register(font);
}
var cssAppliers = new CssAppliersImpl(xmlWorkerFontProvider);
var htmlContext = new HtmlPipelineContext(cssAppliers);
htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
PdfWriterPipeline pdfWriterPipeline = new PdfWriterPipeline(document, pdfWriter);
HtmlPipeline htmlPipeline = new HtmlPipeline(htmlContext, pdfWriterPipeline);
CssResolverPipeline cssResolverPipeline = new CssResolverPipeline(cssResolver, htmlPipeline);
XMLWorker xmlWorker = new XMLWorker(cssResolverPipeline, true);
XMLParser xmlParser = new XMLParser(xmlWorker);
xmlParser.Parse(stringReader);
document.Close();
}
}
(Source: iTextSharp XmlWorker: right-to-left)
If you want an easier example, take a look at the answers of these questions:
How to parse multiple HTML files into a single PDF?
How to add a rich Textbox (HTML) to a table cell?
...
The code that parses an HTML string and a CSS string to a list of iText(Sharp) elements is as simple as this:
ElementList list = XMLWorkerHelper.parseToElementList(html, css);
You can find more examples on the official iText web site.

ITextsharp not displaying ARABIC data when converting html table data to PDF

I am making a report.following is a code and sample.I am using html table for reports.When I run the code pdf is successfully generated but Arabic is not showing.Can you guide me how can i embed Arabic in it.Can you modify my following code which shows arabic data.
Response.ContentType = "application/pdf";
Response.AddHeader("content-disposition", "attachment;filename=TestPage.pdf");
Response.Cache.SetCacheability(HttpCacheability.NoCache);
StringWriter sw = new StringWriter();
HtmlTextWriter hw = new HtmlTextWriter(sw);
tblid1.RenderControl(hw);
StringReader sr = new StringReader(sw.ToString());
Document pdfDoc = new Document(PageSize.A4, 80f, 80f, -2f, 35f);
HTMLWorker htmlparser = new HTMLWorker(pdfDoc);
PdfWriter writer = PdfWriter.GetInstance(pdfDoc, Response.OutputStream);
pdfDoc.Open();
htmlparser.Parse(sr);
pdfDoc.Close();
Response.Write(pdfDoc);
Response.End();
<table id="tblid1" runat="server">
<tr>
<td>سلطانالخارج</td>
<td>مسندم</td>
</tr>
</table>

You would need to embed a font into your pdf that supports arabic glyphs.
string fontpath = Environment.GetEnvironmentVariable( "SystemRoot" ) + "\\fonts\\arabtype.ttf";
BaseFont basefont = BaseFont.CreateFont( fontpath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED );
Font arabicFont = new Font( basefont, 10f, Font.NORMAL );
Answer found in thread: Itextsharp and arabic character?
EDIT: This is how I would do it based on the examples I could find and what you're trying to accomplish:
using(WebClient client = new WebClient()) {
string htmlString = client.DownloadString(url);
}
FontFactory.Register("c:/windows/fonts/arabtype.TTF");
StyleSheet style = new StyleSheet();
style.LoadTagStyle("body", "face", "%NAME OF ARABIC FONT%");
style.LoadTagStyle("body", "encoding", BaseFont.IDENTITY_H);
using (Document document = new Document(PageSize.A4, 80f, 80f, -2f, 35f)) {
PdfWriter writer = PdfWriter.GetInstance(
document, Response.OutputStream
);
document.Open();
foreach(IElement element in HTMLWorker.ParseToList(
new StringReader(htmlString), style))
{
document.Add(element);
}
}
Note that you would need to ensure you are registering the correct TTF file that contains the encoding for arabic characters and that you would need to replace %NAME OF ARABIC FONT% with the name of the font you're using, and replace %VARIABLE CONTAINING YOUR RAW HTML% with the actual HTML.

var arialFontPath = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "ARIALUNI.TTF");
FontFactory.Register(arialFontPath);
BaseFont bf = BaseFont.CreateFont(arialFontPath, BaseFont.IDENTITY_H, true);
iTextSharp.text.Font FontAr = new iTextSharp.text.Font(bf);
iTextSharp.text.FontFactory.RegisterDirectory(arialFontPath);
StyleSheet styles = new StyleSheet();
styles.LoadTagStyle(HtmlTags.DIV, HtmlTags.FONTSIZE, "16");
styles.LoadTagStyle(HtmlTags.DIV, HtmlTags.COLOR, "navy");
styles.LoadTagStyle(HtmlTags.DIV, HtmlTags.FONTWEIGHT, "bold");
styles.LoadTagStyle(HtmlTags.P, HtmlTags.INDENT, "30px");
styles.LoadTagStyle(HtmlTags.BODY, HtmlTags.FACE, "Arial Unicode MS");
styles.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, BaseFont.IDENTITY_H);
List<IElement> htmlarraylist = HTMLWorker.ParseToList(new StringReader(htmlText), styles);
for (int k = 0; k < htmlarraylist.Count; k++)
{
pdfDocument.Add((IElement)htmlarraylist[k]);
}
this piece of code worked for me, arabic language works perfect in this,
just pass your html text to my htmltext Variable..
Hopefully it will work fine.

Html to PDf conversion unicode character rendering as empty

I am converting some html into pdf using itext sharp. First i have filled out some html string into String Writer then using below mentioned code to converty byte array into pdf
Problem is unicode character [arabic in specific] is rendering empty.
My code is
var sw = new StringWriter();
sw = GetHtmlContent();// here i fetch html
byte[] data;
using (var sr = new StringReader(sw.ToString()))
{
using (var ms = new MemoryStream())
{
using (var pdfDoc = new Document())
{
//Bind a parser to our PDF document
using (var htmlparser = new HTMLWorker(pdfDoc))
{
//Bind the writer to our document and our final stream
using (var w = PdfWriter.GetInstance(pdfDoc, ms))
{
pdfDoc.Open();
//Parse the HTML directly into the document
htmlparser.Parse(sr);
pdfDoc.Close();
//Grab the bytes from the stream before closing it
data = ms.ToArray();
}
}
}
}
}
Response.Buffer = false;
Response.Clear();
Response.ClearContent();
Response.ClearHeaders();
Response.ContentType = "application/pdf";
Response.AddHeader("Content-Disposition", "attachment; filename=Test.pdf");
Response.BinaryWrite(data);
Response.End();
Please help me what's wrong in it

Check below steps to display unicode characters in converting Html to Pdf
Create a HTMLWorker
Register a unicode font and assign it
Create a style sheet and set the encoding to Identity-H
Assign the style sheet to the html parser
Check below code
TextReader reader = new StringReader(html);
Document document = new Document(PageSize.A4, 30, 30, 30, 30);
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(FileName, FileMode.Create));
HTMLWorker worker = new HTMLWorker(document);
document.Open();
FontFactory.Register("C:\\Windows\\Fonts\\ARIALUNI.TTF", "arial unicode ms");
iTextSharp.text.html.simpleparser.StyleSheet ST = new iTextSharp.text.html.simpleparser.StyleSheet();
ST.LoadTagStyle("body", "encoding", "Identity-H");
worker.Style = ST;
worker.StartDocument();
Check below link for more understanding....
Display Unicode characters in converting Html to Pdf
Hindi, Turkish, and special characters are also display during converting from HTML to PDF using this method. Check below demo image.

Display Unicode characters in converting Html to Pdf

I am using itextsharp dll to convert HTML to PDF.
The HTML has some Unicode characters like α, β... when I try to convert HTML to PDF, Unicode characters are not shown in PDF.
My function:
Document doc = new Document(PageSize.LETTER);
using (FileStream fs = new FileStream(Path.Combine("Test.pdf"), FileMode.Create, FileAccess.Write, FileShare.Read))
{
PdfWriter.GetInstance(doc, fs);
doc.Open();
doc.NewPage();
string arialuniTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts),
"ARIALUNI.TTF");
BaseFont bf = BaseFont.CreateFont(arialuniTff, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
Font fontNormal = new Font(bf, 12, Font.NORMAL);
List<IElement> list = HTMLWorker.ParseToList(new StringReader(stringBuilder.ToString()),
new StyleSheet());
Paragraph p = new Paragraph {Font = fontNormal};
foreach (var element in list)
{
p.Add(element);
doc.Add(p);
}
doc.Close();
}

You can also use the new XMLWorkerHelper (from library itextsharp.xmlworker), you need to override the default FontFactory implementation however.
void GeneratePdfFromHtml()
{
const string outputFilename = #".\Files\report.pdf";
const string inputFilename = #".\Files\report.html";
using (var input = new FileStream(inputFilename, FileMode.Open))
using (var output = new FileStream(outputFilename, FileMode.Create))
{
CreatePdf(input, output);
}
}
void CreatePdf(Stream htmlInput, Stream pdfOutput)
{
using (var document = new Document(PageSize.A4, 30, 30, 30, 30))
{
var writer = PdfWriter.GetInstance(document, pdfOutput);
var worker = XMLWorkerHelper.GetInstance();
document.Open();
worker.ParseXHtml(writer, document, htmlInput, null, Encoding.UTF8, new UnicodeFontFactory());
document.Close();
}
}
public class UnicodeFontFactory : FontFactoryImp
{
private static readonly string FontPath = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts),
"arialuni.ttf");
private readonly BaseFont _baseFont;
public UnicodeFontFactory()
{
_baseFont = BaseFont.CreateFont(FontPath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
}
public override Font GetFont(string fontname, string encoding, bool embedded, float size, int style, BaseColor color,
bool cached)
{
return new Font(_baseFont, size, style, color);
}
}

When dealing with Unicode characters and iTextSharp there's a couple of things you need to take care of. The first one you did already and that's getting a font that supports your characters. The second thing is that you want to actually register the font with iTextSharp so that its aware of it.
//Path to our font
string arialuniTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "ARIALUNI.TTF");
//Register the font with iTextSharp
iTextSharp.text.FontFactory.Register(arialuniTff);
Now that we have a font we need to create a StyleSheet object that tells iTextSharp when and how to use it.
//Create a new stylesheet
iTextSharp.text.html.simpleparser.StyleSheet ST = new iTextSharp.text.html.simpleparser.StyleSheet();
//Set the default body font to our registered font's internal name
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.FACE, "Arial Unicode MS");
The one non-HTML part that you also need to do is set a special encoding parameter. This encoding is specific to iTextSharp and in your case you want it to be Identity-H. If you don't set this then it default to Cp1252 (WINANSI).
//Set the default encoding to support Unicode characters
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, BaseFont.IDENTITY_H);
Lastly, we need to pass our stylesheet to the ParseToList method:
//Parse our HTML using the stylesheet created above
List<IElement> list = HTMLWorker.ParseToList(new StringReader(stringBuilder.ToString()), ST);
Putting that all together, from open to close you'd have:
doc.Open();
//Sample HTML
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.Append(#"<p>This is a test: <strong>α,β</strong></p>");
//Path to our font
string arialuniTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "ARIALUNI.TTF");
//Register the font with iTextSharp
iTextSharp.text.FontFactory.Register(arialuniTff);
//Create a new stylesheet
iTextSharp.text.html.simpleparser.StyleSheet ST = new iTextSharp.text.html.simpleparser.StyleSheet();
//Set the default body font to our registered font's internal name
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.FACE, "Arial Unicode MS");
//Set the default encoding to support Unicode characters
ST.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, BaseFont.IDENTITY_H);
//Parse our HTML using the stylesheet created above
List<IElement> list = HTMLWorker.ParseToList(new StringReader(stringBuilder.ToString()), ST);
//Loop through each element, don't bother wrapping in P tags
foreach (var element in list) {
doc.Add(element);
}
doc.Close();
EDIT
In your comment you show HTML that specifies an override font. iTextSharp does not spider the system for fonts and its HTML parser doesn't use font fallback techniques. Any fonts specified in HTML/CSS must be manually registered.
string lucidaTff = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "l_10646.ttf");
iTextSharp.text.FontFactory.Register(lucidaTff);

private class UnicodeFontFactory : FontFactoryImp
{
private BaseFont _baseFont;
public UnicodeFontFactory()
{
string FontPath = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "arialuni.ttf");
_baseFont = BaseFont.CreateFont(FontPath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
}
public override Font GetFont(string fontname, string encoding, bool embedded, float size, int style, BaseColor color, bool cached)
{
return new Font(_baseFont, size, style, color);
}
}
//and Code
FontFactory.FontImp = new UnicodeFontFactory();
string convertedHtml = string.Empty;
foreach (char c in htmlText)
{
if (c < 127)
convertedHtml += c;
else
convertedHtml += "&#" + (int)c + ";";
}
List<IElement> htmlElements = XMLWorkerHelper.ParseToElementList(convertedHtml, null);
// add the IElements to the document
foreach (IElement htmlElement in htmlElements)
{
document.Add(htmlElement);
}

This has to be one of the most difficult problems that I've had to figure out to date. The answers on the web, including stack overflow has either poor or outdated information. The answer from Gregor is very close. I wanted to give back to this community because I spent many hours to get to this answer.
Here's a very simple program I wrote in c# as an example for my own notes.
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.tool.xml;
namespace ExampleOfExportingPDF
{
class Program
{
static void Main(string[] args)
{
//Build HTML document
StringBuilder sb = new StringBuilder();
sb.Append("<body>");
sb.Append("<h1 style=\"text-align:center;\">これは日本語のテキストの例です。</h1>");
sb.Append("</body>");
//Create our document object
Document Doc = new Document(PageSize.A4);
//Create our file stream
using (FileStream fs = new FileStream(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Test.pdf"), FileMode.Create, FileAccess.Write, FileShare.Read))
{
//Bind PDF writer to document and stream
PdfWriter writer = PdfWriter.GetInstance(Doc, fs);
//Open document for writing
Doc.Open();
//Add a page
Doc.NewPage();
MemoryStream msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(sb.ToString()));
XMLWorkerHelper.GetInstance().ParseXHtml(writer, Doc, msHtml, null, Encoding.UTF8, new UnicodeFontFactory());
//Close the PDF
Doc.Close();
}
}
public class UnicodeFontFactory : FontFactoryImp
{
private static readonly string FontPath = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts),
"arialuni.ttf");
private readonly BaseFont _baseFont;
public UnicodeFontFactory()
{
_baseFont = BaseFont.CreateFont(FontPath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
}
public override Font GetFont(string fontname, string encoding, bool embedded, float size, int style, BaseColor color,
bool cached)
{
return new Font(_baseFont, size, style, color);
}
}
}
}
Hopefully this will save someone some time in the future.

Here is the few steps to display unicode characters in converting Html to Pdf
Create a HTMLWorker
Register a unicode font and assign it
Create a style sheet and set the encoding to Identity-H
Assign the style sheet to the html parser
Check below link for more understanding....
Display Unicode characters in converting Html to Pdf
Hindi, Turkish, and special characters are also display during converting from HTML to PDF using this method. Check below demo image.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Html to pdf some characters are missing (itextsharp) - c#

You can use: iTextSharp.text.pdf.BaseFont Vn_Helvetica = iTextSharp.text.pdf.BaseFont.CreateFont(#"C:\Windows\Fonts\arial.ttf", "Identity-H", iTextSharp.text.pdf.BaseFont.EMBEDDED); iTextSharp.text.Font fontNormal = new iTextSharp.text.Font(Vn_Helvetica, 12, iTextSharp.text.Font.NORMAL);

Don't change the source code of the iTextSharp. Define a new style: var styles = new StyleSheet(); styles.LoadTagStyle(HtmlTags.BODY, HtmlTags.FONTFAMILY, "tahoma"); styles.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, "Identity-H"); and then pass it to the HTMLWorker.ParseToList method.

I strongly suggest not to change itextsharp source code in order to solve this problem. Have a look at my other comment on the subject: https://stackoverflow.com/a/24587745/1138663

Related

getting text from a pdf document itextsharp

itextsharp html to pdf

ITextsharp not displaying ARABIC data when converting html table data to PDF

Html to PDf conversion unicode character rendering as empty

Display Unicode characters in converting Html to Pdf

Categories

Resources