get text paragraph from pdf using itextsharp

get text paragraph from pdf using itextsharp - c#

is there any logic to get paragraph text from pdf file using itextsharp?i know pdf only supports run of texts and its hard to determine which runs of texts are related to which paragraph and also i know that there isn't any <p> tags or other tags to determine paragraph in pdf..However i have tried to get coordinate of runs of texts to build paragraph from its coordinate but with no luck :(.
my code snippet is here:
private StringBuilder result = new StringBuilder();
private Vector lastBaseLine;
//to store run of texts
public List<string> strings = new List<String>();
//to store run of texts Coordinate (Y coordinate)
public List<float> baselines = new List<float>();
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
{
Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
if ((this.lastBaseLine != null) && (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]))
{
if ((!string.IsNullOrEmpty(this.result.ToString())))
{
this.baselines.Add(this.lastBaseLine[Vector.I2]);
this.strings.Add(this.result.ToString());
}
result = new StringBuilder();
}
this.result.Append(renderInfo.GetText());
this.lastBaseLine = curBaseline;
}
Do any body have any logic related to this issue??

using (MemoryStream ms = new MemoryStream())
{
Document document = new Document(PageSize.A4, 25, 25, 30, 30);
PdfWriter writer = PdfWriter.GetInstance(document, ms);
document.Open();
document.Add(new Paragraph("Hello World"));
document.Close();
writer.Close();
Response.ContentType = "pdf/application";
Response.AddHeader("content-disposition",
"attachment;filename=First PDF document.pdf");
Response.OutputStream.Write(ms.GetBuffer(), 0, ms.GetBuffer().Length);
}
here are some samples which ll help you on this....
This is may not be exactly your looking for, but it may help you..

Related

Watermark a pdf document with itext 7 instead of iTextSharp

This was my code for itextsharp which worked ok. It displayed "Quote Only" in the middle of each page in a pdf file.
iTextSharp.text.Image img = iTextSharp.text.Image.GetInstance(Server.MapPath(#"~\Content\WaterMarkQuoteOnly.png"));
PdfReader readerOriginalDoc = new PdfReader(File(all, "application/pdf").FileContents);
int n = readerOriginalDoc.NumberOfPages;
img.SetAbsolutePosition(0, 300);
PdfGState _state = new PdfGState()
{
FillOpacity = 0.1F,
StrokeOpacity = 0.1F
};
using (MemoryStream ms = new MemoryStream())
{
using (PdfStamper stamper = new PdfStamper(readerOriginalDoc, ms, '\0', true))
{
for (int i = 1; i <= n; i++)
{
PdfContentByte content = stamper.GetOverContent(i);
content.SaveState();
content.SetGState(_state);
content.AddImage(img);
content.RestoreState();
}
}
//return ms.ToArray();
all = ms.GetBuffer();
}
This is my new itext 7 code, this also displays the watermark but the position is wrong. I was dismayed to see that you cant add an image to the canvas but you have to add ImageData when the position is being set on the image. The image is also way smaller and back to front.
var imagePath = Server.MapPath(#"~\Content\WaterMarkQuoteOnly.png");
var tranState = new iText.Kernel.Pdf.Extgstate.PdfExtGState();
tranState.SetFillOpacity(0.1f);
tranState.SetStrokeOpacity(0.1f);
ImageData myImageData = ImageDataFactory.Create(imagePath, false);
Image img = new Image(myImageData);
img.SetFixedPosition(0, 300);
var reader = new PdfReader(new MemoryStream(all));
var doc = new PdfDocument(reader);
int pages = doc.GetNumberOfPages();
using (var ms = new MemoryStream())
{
var writer = new PdfWriter(ms);
var newdoc = new PdfDocument(writer);
for (int i = 1; i <= pages; i++)
{
//get existing page
PdfPage page = doc.GetPage(i);
//copy page to new document
newdoc.AddPage(page.CopyTo(newdoc)); ;
//get our new page
PdfPage newpage = newdoc.GetPage(i);
Rectangle pageSize = newpage.GetPageSize();
//get canvas based on new page
var canvas = new PdfCanvas(newpage);
//write image data to new page
canvas.SaveState().SetExtGState(tranState);
canvas.AddImage(myImageData, pageSize, true);
canvas.RestoreState();
}
newdoc.Close();
all = ms.GetBuffer();
ms.Flush();
}

You are doing something strange with the PdfDocument objects, and you are also using the wrong AddImage() method.
I am not a C# developer, so I rewrote your example in Java. I took this PDF file:
And I took this image:
Then I added the image to the PDF file using transparency with the following result:
The code to do this, was really simple:
public void createPdf(String src, String dest) throws IOException {
PdfExtGState tranState = new PdfExtGState();
tranState.setFillOpacity(0.1f);
ImageData img = ImageDataFactory.create(IMG);
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdf = new PdfDocument(reader, writer);
for (int i = 1; i <= pdf.getNumberOfPages(); i++) {
PdfPage page = pdf.getPage(i);
PdfCanvas canvas = new PdfCanvas(page);
canvas.saveState().setExtGState(tranState);
canvas.addImage(img, 36, 600, false);
canvas.restoreState();
}
pdf.close();
}
For some reason, you created two PdfDocument instances. This isn't necessary. You also used the AddImage() method passing a Rectangle which resizes the image. Also make sure that you don't add the image as an inline image, because that bloats the file size.
I don't know which programming language you are using. For instance: I am not used to variables that are created using var such as var tranState. It should be very easy for you to adapt my Java code though. It's just a matter of changing lowercases into uppercases.

Adding Page Navigation links to PDF document using itextsharp [duplicate]

I have written some code that merges together multiple PDF's into a single PDF that I then display from the MemoryStream. This works great. What I need to do is add a table of contents to the end of the file with links to the start of each of the individual PDF's. I planned on doing this using the GotoLocalPage action which has an option for page numbers but it doesn't seem to work. If I change the action to the code below to one of the presset ones like PDFAction.FIRSTPAGE it works fine. Does this not work because I am using the PDFCopy object for the writer parameter of GotoLocalPage?
Document mergedDoc = new Document();
MemoryStream ms = new MemoryStream();
PdfCopy copy = new PdfCopy(mergedDoc, ms);
mergedDoc.Open();
MemoryStream tocMS = new MemoryStream();
Document tocDoc = null;
PdfWriter tocWriter = null;
for (int i = 0; i < filesToMerge.Length; i++)
{
string filename = filesToMerge[i];
PdfReader reader = new PdfReader(filename);
copy.AddDocument(reader);
// Initialise TOC document based off first file
if (i == 0)
{
tocDoc = new Document(reader.GetPageSizeWithRotation(1));
tocWriter = PdfWriter.GetInstance(tocDoc, tocMS);
tocDoc.Open();
}
// Create link for TOC, added random number of 3 for now
Chunk link = new Chunk(filename);
PdfAction action = PdfAction.GotoLocalPage(3, new PdfDestination(PdfDestination.FIT), copy);
link.SetAction(action);
tocDoc.Add(new Paragraph(link));
}
// Add TOC to end of merged PDF
tocDoc.Close();
PdfReader tocReader = new PdfReader(tocMS.ToArray());
copy.AddDocument(tocReader);
copy.Close();
displayPDF(ms.ToArray());
I guess an alternative would be to link to a named element (instead of page number) but I can't see how to add an 'invisible' element to the start of each file before adding to the merged document?

I would just go with two passes. In your first pass, do the merge as you are but also record the filename and page number it should link to. In your second pass, use a PdfStamper which will give you access to a ColumnText that you can use general abstractions like Paragraph in. Below is a sample that shows this off:
Since I don't have your documents, the below code creates 10 documents with a random number of pages each just for testing purposes. (You obviously don't need to do this part.) It also creates a simple dictionary with a fake file name as the key and the raw bytes from the PDF as a value. You have a true file collection to work with but you should be able to adapt that part.
//Create a bunch of files, nothing special here
//files will be a dictionary of names and the raw PDF bytes
Dictionary<string, byte[]> Files = new Dictionary<string, byte[]>();
var r = new Random();
for (var i = 1; i <= 10; i++) {
using (var ms = new MemoryStream()) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, ms)) {
doc.Open();
//Create a random number of pages
for (var j = 1; j <= r.Next(1, 5); j++) {
doc.NewPage();
doc.Add(new Paragraph(String.Format("Hello from document {0} page {1}", i, j)));
}
doc.Close();
}
}
Files.Add("File " + i.ToString(), ms.ToArray());
}
}
This next block merges the PDFs. This is mostly the same as your code except that instead of writing a TOC here I'm just keeping track of what I want to write in the future. Where I'm using file.value you'd use your full file path and where I'm using file.key you'd use your file's name instead.
//Dictionary of file names (for display purposes) and their page numbers
var pages = new Dictionary<string, int>();
//PDFs start at page 1
var lastPageNumber = 1;
//Will hold the final merged PDF bytes
byte[] mergedBytes;
//Most everything else below is standard
using (var ms = new MemoryStream()) {
using (var document = new Document()) {
using (var writer = new PdfCopy(document, ms)) {
document.Open();
foreach (var file in Files) {
//Add the current page at the previous page number
pages.Add(file.Key, lastPageNumber);
using (var reader = new PdfReader(file.Value)) {
writer.AddDocument(reader);
//Increment our current page index
lastPageNumber += reader.NumberOfPages;
}
}
}
}
mergedBytes = ms.ToArray();
}
This last block actually writes the TOC. If we use a PdfStamper we can create a ColumnText which allows us to use Paragraphs
//Will hold the final PDF
byte[] finalBytes;
using (var ms = new MemoryStream()) {
using (var reader = new PdfReader(mergedBytes)) {
using (var stamper = new PdfStamper(reader, ms)) {
//The page number to insert our TOC into
var tocPageNum = reader.NumberOfPages + 1;
//Arbitrarily pick one page to use as the size of the PDF
//Additional logic could be added or this could just be set to something like PageSize.LETTER
var tocPageSize = reader.GetPageSize(1);
//Arbitrary margin for the page
var tocMargin = 20;
//Create our new page
stamper.InsertPage(tocPageNum, tocPageSize);
//Create a ColumnText object so that we can use abstractions like Paragraph
var ct = new ColumnText(stamper.GetOverContent(tocPageNum));
//Set the working area
ct.SetSimpleColumn(tocPageSize.GetLeft(tocMargin), tocPageSize.GetBottom(tocMargin), tocPageSize.GetRight(tocMargin), tocPageSize.GetTop(tocMargin));
//Loop through each page
foreach (var page in pages) {
var link = new Chunk(page.Key);
var action = PdfAction.GotoLocalPage(page.Value, new PdfDestination(PdfDestination.FIT), stamper.Writer);
link.SetAction(action);
ct.AddElement(new Paragraph(link));
}
ct.Go();
}
}
finalBytes = ms.ToArray();
}

Remove Javascript from PDF using iTextSharp

This seems like something that should be quick to do, but in practice there seems to be a problem. I have a bunch of PDF forms that include form fields and embedded javascript. I would like to remove the javascript code safely, but leave the PDF form fields intact.
So far I've been able to find lots of solutions, but all the solutions have either eliminated both the javascript and the form fields, or left both intact.
Here's solution A; it copies both form fields and javascript:
var pdfReader = new PdfReader(infilename);
using (MemoryStream memoryStream = new MemoryStream()) {
PdfCopyFields copy = new PdfCopyFields(memoryStream);
copy.AddDocument(pdfReader);
copy.Close();
File.WriteAllBytes(rawfilename, memoryStream.ToArray());
}
Alternately, I have solution B, that strips out both form fields and javascript:
Document document = new Document();
using (MemoryStream memoryStream = new MemoryStream()) {
PdfWriter writer = PdfWriter.GetInstance(document, memoryStream);
document.Open();
document.AddDocListener(writer);
for (int p = 1; p <= pdfReader.NumberOfPages; p++) {
document.SetPageSize(pdfReader.GetPageSize(p));
document.NewPage();
PdfContentByte cb = writer.DirectContent;
PdfImportedPage pageImport = writer.GetImportedPage(pdfReader, p);
int rot = pdfReader.GetPageRotation(p);
if (rot == 90 || rot == 270) {
cb.AddTemplate(pageImport, 0, -1.0F, 1.0F, 0, 0, pdfReader.GetPageSizeWithRotation(p).Height);
} else {
cb.AddTemplate(pageImport, 1.0F, 0, 0, 1.0F, 0, 0);
}
}
document.Close();
File.WriteAllBytes(rawfile, memoryStream.ToArray());
}
Does anyone know how to modify either solution A or B to eliminate the javascript but leave the form fields in place?
EDIT: Solution code is here!
using (MemoryStream memoryStream = new MemoryStream()) {
PdfStamper stamper = new PdfStamper(pdfReader, memoryStream);
for (int i = 0; i <= pdfReader.XrefSize; i++) {
object o = pdfReader.GetPdfObject(i);
PdfDictionary pd = o as PdfDictionary;
if (pd != null) {
pd.Remove(PdfName.AA);
pd.Remove(PdfName.JS);
pd.Remove(PdfName.JAVASCRIPT);
}
}
stamper.Close();
pdfReader.Close();
File.WriteAllBytes(rawfile, memoryStream.ToArray());
}

To manipulate a single PDF you should use the class PdfStamper and manipulate its contents, in your case iterating over the existing form fields and removing the JavaScript entries.
The iTextSharp sample AddJavaScriptToForm.cs corresponding to AddJavaScriptToForm.java from chapter 13 of iText in Action — 2nd Edition shows how JavaScript actions are added to fields, the central code being:
PdfStamper stamper = new PdfStamper(reader, ms);
AcroFields form = stamper.AcroFields;
AcroFields.Item fd = form.GetFieldItem("married");
PdfDictionary dictYes = (PdfDictionary) PdfReader.GetPdfObject(fd.GetWidgetRef(0));
PdfDictionary yesAction = ...;
dictYes.Put(PdfName.AA, yesAction);
Thus, to remove such JavaScript form field actions you have to iterate over all those PDF form fields and remove the /AA values in the associated dictionaries:
dictXXX.Remove(PdfName.AA);
EDIT: (provided by Ted Spence) Here is the final code that successfully removes javascript while leaving all form fields intact:
using (MemoryStream memoryStream = new MemoryStream())
{
PdfStamper stamper = new PdfStamper(pdfReader, memoryStream);
for (int i = 0; i <= pdfReader.XrefSize; i++)
{
PdfDictionary pd = pdfReader.GetPdfObject(i) as PdfDictionary;
if (pd != null)
{
pd.Remove(PdfName.AA); // Removes automatic execution objects
pd.Remove(PdfName.JS); // Removes javascript objects
pd.Remove(PdfName.JAVASCRIPT); // Removes other javascript objects
}
}
stamper.Close();
pdfReader.Close();
File.WriteAllBytes(rawfile, memoryStream.ToArray());
}
EDIT: (by mkl) The solution above is somewhat overachieving because it touches each and every indirect dictionary object. On the other hand it ignores inline dictionaries (I haven't checked the spec, though; maybe all /AA, /JS, and /JAVASCRIPT entries appear only in dictionaries which have to be indirect objects, or at least are de-referenced by this code).
If fulfilling this task was my job, I would try and access the objects possibly carrying JavaScript more specifically.
The advantage of this overachieving procedure might be, though, that even PDF objects are inspected which currently are not specified as carrying JavaScript but will be in later PDF versions.

Add the following lines after the for loop to keep the AcroForm:
var form = pdfReader.AcroForm;
if (form != null)
writer.CopyAcroForm(reader);

iTextSharp HTMLWorker ParseHTML Tablestyle and PDFStamper

Hi I have succesfully used a HTMLWorker to convert a gridview using asp.NET / C#.
(1) I have applied some limited style to the resulting table but cannot see how to apply tablestyle for instance grid lines or apply other formatting style such as a large column width for example for a particular column.
(2) I would actually like to put this text onto a pre-existing template which contains a logo etc. I've used PDF Stamper before for this but cannot see how I can use both PDFStamper and HTMLWorker at once. HTMLWorker needs a Document which implements iDocListener ... but that doesnt seem compatible with usign a PDFStamper. I guess what I am looking for is a way to create a PDFStamper, write title etc, then add the parsed HTML from the grid. The other problem is that the parsed content doesnt interact with the other stuff on the page. For instance below I add a title chunk to the page. Rather than starting below it, the parsed HTML writes over the top. How do I place / interact the parsed HTML content with the rest of what is on the PDF document ?
Thanks in advance
Rob
Here';s the code I have already
Document pdfDoc = new Document(PageSize.A4, 10f, 10f, 30f, 0f);
HTMLWorker htmlWorker = new HTMLWorker(pdfDoc);
StyleSheet styles = new StyleSheet();
styles.LoadTagStyle("th", "size", "12px");
styles.LoadTagStyle("th", "face", "helvetica");
styles.LoadTagStyle("span", "size", "10px");
styles.LoadTagStyle("span", "face", "helvetica");
styles.LoadTagStyle("td", "size", "10px");
styles.LoadTagStyle("td", "face", "helvetica");
htmlWorker.SetStyleSheet(styles);
PdfWriter.GetInstance(pdfDoc, HttpContext.Current.Response.OutputStream);
pdfDoc.Open();
//Title - but this gets obsured by data, doesnt move it down
Font font = new Font(Font.FontFamily.HELVETICA, 14, Font.BOLD);
Chunk chunk = new Chunk(title, font);
pdfDoc.Add(chunk);
//Body
htmlWorker.Parse(sr);

Let me first give you a couple of links to look over when you get a chance:
ItextSharp support for HTML and CSS
How to apply font properties on while passing html to pdf using itextsharp
These answers go deeper into what's going on and I recommend reading them when you get a chance. Specifically the second one will show you why you need to use pt instead of px.
To answer your first question let me show you a different way to use the HTMLWorker class. This class has a static method on it called ParseToList that will convert HTML to a List<IElement>. The objects in that list are all iTextSharp specific versions of your HTML. Normally you would do a foreach on those and just add them to a document but you can modify them before adding which is what you want to do. Below is code that takes a static string and does that:
string file1 = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "File1.pdf");
using (FileStream fs = new FileStream(file1, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document doc = new Document(PageSize.LETTER))
{
using (PdfWriter writer = PdfWriter.GetInstance(doc, fs))
{
doc.Open();
//Our HTML
string html = "<table><tr><th>First Name</th><th>Last Name</th></tr><tr><td>Chris</td><td>Haas</td></tr></table>";
//ParseToList requires a StreamReader instead of just a string so just wrap it
using (StringReader sr = new StringReader(html))
{
//Create a style sheet
StyleSheet styles = new StyleSheet();
//...styles omitted for brevity
//Convert our HTML to iTextSharp elements
List<IElement> elements = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(sr, styles);
//Loop through each element (in this case there's actually just one PdfPTable)
foreach (IElement el in elements)
{
//If the element is a PdfPTable
if (el is PdfPTable)
{
//Cast it
PdfPTable tt = (PdfPTable)el;
//Change the widths, these are relative width by the way
tt.SetWidths(new float[] { 75, 25 });
}
//Add the element to the document
doc.Add(el);
}
}
doc.Close();
}
}
}
Hopefully you can see that once you get access to the raw PdfPTable you can tweak it as necessary.
To answer your second question, if you want to use the normal Paragraph and Chunk objects with a PdfStamper then you need to use a PdfContentByte object. You can get this from your stamper in one of two ways, either by asking for one that sits "above" existing content, stamper.GetOverContent(int) or one that sits "below" existing content, stamper.GetUnderContent(int). Both versions take a single parameter saying what page to work with. Once you have a PdfContentByte you can create a ColumnText object bound to it and use this object's AddElement() method to add your normal elements. Before doing this (and this answers your third question), you'll want to create at least one "column". When I do this I generally create one that essentially covers the entire page. (This part might sound weird but we're essentially make a single row, single column table cell to add our objects to.)
Below is a full working C# 2010 WinForms app targeting iTextSharp 5.1.1.0 that shows off everything above. First it creates a generic PDF on the desktop. Then it creates a second document based off of the first, adds a paragraph and then some HTML. See the comments in the code for any questions.
using System;
using System.Collections.Generic;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text;
using iTextSharp.text.html.simpleparser;
using iTextSharp.text.pdf;
using System.IO;
namespace WindowsFormsApplication1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
//The two files that we are creating
string file1 = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "File1.pdf");
string file2 = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "File2.pdf");
//Create a base file to write on top of
using (FileStream fs = new FileStream(file1, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document doc = new Document(PageSize.LETTER))
{
using (PdfWriter writer = PdfWriter.GetInstance(doc, fs))
{
doc.Open();
doc.Add(new Paragraph("Hello world"));
doc.Close();
}
}
}
//Bind a reader to our first document
PdfReader reader = new PdfReader(file1);
//Create our second document
using (FileStream fs = new FileStream(file2, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (PdfStamper stamper = new PdfStamper(reader, fs))
{
StyleSheet styles = new StyleSheet();
//...styles omitted for brevity
//Our HTML
string html = "<table><tr><th>First Name</th><th>Last Name</th></tr><tr><td>Chris</td><td>Haas</td></tr></table>";
//ParseToList requires a StreamReader instead of just a string so just wrap it
using (StringReader sr = new StringReader(html))
{
//Get our raw PdfContentByte object letting us draw "above" existing content
PdfContentByte cb = stamper.GetOverContent(1);
//Create a new ColumnText object bound to the above PdfContentByte object
ColumnText ct = new ColumnText(cb);
//Get the dimensions of the first page of our source document
iTextSharp.text.Rectangle page1size = reader.GetPageSize(1);
//Create a single column object spanning the entire page
ct.SetSimpleColumn(0, 0, page1size.Width, page1size.Height);
ct.AddElement(new Paragraph("Hello world!"));
//Convert our HTML to iTextSharp elements
List<IElement> elements = iTextSharp.text.html.simpleparser.HTMLWorker.ParseToList(sr, styles);
//Loop through each element (in this case there's actually just one PdfPTable)
foreach (IElement el in elements)
{
//If the element is a PdfPTable
if (el is PdfPTable)
{
//Cast it
PdfPTable tt = (PdfPTable)el;
//Change the widths, these are relative width by the way
tt.SetWidths(new float[] { 75, 25 });
}
//Add the element to the ColumnText
ct.AddElement(el);
}
//IMPORTANT, this actually commits our object to the PDF
ct.Go();
}
}
}
this.Close();
}
}
}

protected void LinkPdf_Click(object sender, EventArgs e)
{
Response.ContentType = "application/pdf";
Response.AddHeader("content-disposition", "attachment;filename=TestPage.pdf");
Response.Cache.SetCacheability(HttpCacheability.NoCache);
StringWriter sw = new StringWriter();
HtmlTextWriter hw = new HtmlTextWriter(sw);
this.Page.RenderControl(hw);
StringReader sr = new StringReader(sw.ToString());
Document pdfDoc = new Document(PageSize.A4, 10f, 10f, 100f, 0f);
HTMLWorker htmlparser = new HTMLWorker(pdfDoc);
PdfWriter.GetInstance(pdfDoc, Response.OutputStream);
pdfDoc.Open();
htmlparser.Parse(sr);
pdfDoc.Close();
Response.Write(pdfDoc);
Response.End();
}

Html to pdf some characters are missing (itextsharp)

I want to export gridview to pdf by using the itextsharp library. The problem is that some turkish characters such as İ,ı,Ş,ş etc... are missing in the pdf document. The code used to export the pdf is:
protected void LinkButtonPdf_Click(object sender, EventArgs e)
{
Response.ContentType = "application/pdf";
Response.ContentEncoding = System.Text.Encoding.UTF8;
Response.AddHeader("content-disposition", "attachment;filename=FileName.pdf");
Response.Cache.SetCacheability(HttpCacheability.NoCache);
System.IO.StringWriter stringWrite = new StringWriter();
System.Web.UI.HtmlTextWriter htmlWrite = new HtmlTextWriter(stringWrite);
GridView1.RenderControl(htmlWrite);
StringReader reader = new StringReader(textConvert(stringWrite.ToString()));
Document doc = new Document(PageSize.A4);
HTMLWorker parser = new HTMLWorker(doc);
PdfWriter.GetInstance(doc, Response.OutputStream);
doc.Open();
parser.Parse(reader);
doc.Close();
}
public static string textConvert(string S)
{
if (S == null) { return null; }
try
{
System.Text.Encoding encFrom = System.Text.Encoding.UTF8;
System.Text.Encoding encTo = System.Text.Encoding.UTF8;
string str = S;
Byte[] b = encFrom.GetBytes(str);
return encTo.GetString(b);
}
catch { return null; }
}
Note: when I want to insert characters into the pdf document, the missing characters are shown in it. I insert the characters with this code:
BaseFont bffont = BaseFont.CreateFont("C:\\WINDOWS\\Fonts\\arial.ttf", BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
Font fontozel = new Font(bffont, 12, Font.NORMAL, new Color(0, 0, 0));
doc.Add(new Paragraph("İİııŞŞşşĞĞğğ", fontozel));

Finaly I think I found the solution,I changed itextsharp source code a little in order to show turkish characters.(turkish character code is cp1254)
I add "public const string CP1254 = "Cp1254";" to [BaseFont.cs] in the source code.
After that I modify the [FactoryProperties.cs].I changed like this;
public Font GetFont(ChainedProperties props)
{
I don't write the whole code.I changed only code below;
------------Default itextsharp code------------------------------------------------------
if (encoding == null)
encoding = BaseFont.WINANSI;
return fontImp.GetFont(face, encoding, true, size, style, color);
-------------modified code--------------------------------------------
encoding = BaseFont.CP1254;
return fontImp.GetFont("C:\\WINDOWS\\Fonts\\arial.ttf", encoding, true, size, style, color);
}
.After I compile new dll ,and missing characters are shown.

No need to change the source code.
Try this:
iTextSharp.text.pdf.BaseFont STF_Helvetica_Turkish = iTextSharp.text.pdf.BaseFont.CreateFont("Helvetica","Cp1254", iTextSharp.text.pdf.BaseFont.NOT_EMBEDDED);
iTextSharp.text.Font fontNormal = new iTextSharp.text.Font(STF_Helvetica_Turkish, 12, iTextSharp.text.Font.NORMAL);

thank you very much all who posted the samples..
i use the below solution from codeproject , and there was the turkish char set problems due to font..
If you use htmlworker you should register font and pass to htmlworker
http://www.codeproject.com/Articles/260470/PDF-reporting-using-ASP-NET-MVC3
StyleSheet styles = new iTextSharp.text.html.simpleparser.StyleSheet();
styles.LoadTagStyle("h3", "size", "5");
styles.LoadTagStyle("td", "size", ".6");
FontFactory.Register("c:\\windows\\fonts\\arial.ttf", "Garamond"); // just give a path of arial.ttf
styles.LoadTagStyle("body", "face", "Garamond");
styles.LoadTagStyle("body", "encoding", "Identity-H");
styles.LoadTagStyle("body", "size", "12pt");
using (var htmlViewReader = new StringReader(htmlText))
{
using (var htmlWorker = new HTMLWorker(pdfDocument, null, styles))
{
htmlWorker.Parse(htmlViewReader);
}
}

I am not familiar with the iTextSharp library; however, you seem to be converting the output of your gridview component to a string and reading from that string to construct your PDF document. You also have a strange conversion from UTF-8 to UTF-8 going on.
From what I can see (given that your GridView is outputting characters correctly) if you are outputting the characters to a string they would be represented as UTF-16 in memory. You probably need to pass this string directly into the PDF library (like how you pass the raw UTF-16 .NET string "İııŞŞşşĞĞğğ" as it is).

You can use:
iTextSharp.text.pdf.BaseFont Vn_Helvetica = iTextSharp.text.pdf.BaseFont.CreateFont(#"C:\Windows\Fonts\arial.ttf", "Identity-H", iTextSharp.text.pdf.BaseFont.EMBEDDED);
iTextSharp.text.Font fontNormal = new iTextSharp.text.Font(Vn_Helvetica, 12, iTextSharp.text.Font.NORMAL);

For Turkish encoding
CultureInfo ci = new CultureInfo("tr-TR");
Encoding enc = Encoding.GetEncoding(ci.TextInfo.ANSICodePage);
If you're outputting HTML, try different DOCTYPE tags at the top of the page.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
Note if using HTML you may need to HTMLEncode the characters.
Server.HTMLEncode()
HttpServerUtility.HtmlEncode()

BaseFont bF = BaseFont.CreateFont("c:\\arial.ttf","windows-1254",true);
Font f = new Font(bF,12f,Font.NORMAL);
Chunk c = new Chunk();
c.Font = f;
c.Append("Turkish characters: ĞÜŞİÖÇ ğüşıöç");
document.Add(c);
In the first line, you may write these instead of "windows-1254". All works:
Cp1254
iso-8859-9
windows-1254

Don't change the source code of the iTextSharp. Define a new style:
var styles = new StyleSheet();
styles.LoadTagStyle(HtmlTags.BODY, HtmlTags.FONTFAMILY, "tahoma");
styles.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, "Identity-H");
and then pass it to the HTMLWorker.ParseToList method.

i have finally find a soultution for this problem , by this you can print all turkish character.
String htmlText = html.ToString();
Document document = new Document();
string filePath = HostingEnvironment.MapPath("~/Content/Pdf/");
PdfWriter.GetInstance(document, new FileStream(filePath + "\\pdf-"+Name+".pdf", FileMode.Create));
document.Open();
iTextSharp.text.html.simpleparser.HTMLWorker hw = new iTextSharp.text.html.simpleparser.HTMLWorker(document);
FontFactory.Register(Path.Combine(_webHelper.MapPath("~/App_Data/Pdf/arial.ttf")), "Garamond"); // just give a path of arial.ttf
StyleSheet css = new StyleSheet();
css.LoadTagStyle("body", "face", "Garamond");
css.LoadTagStyle("body", "encoding", "Identity-H");
css.LoadTagStyle("body", "size", "12pt");
hw.SetStyleSheet(css);
hw.Parse(new StringReader(htmlText));

I strongly suggest not to change itextsharp source code in order to solve this problem. Have a look at my other comment on the subject: https://stackoverflow.com/a/24587745/1138663

I solved the problem. I can provide my the other solution type...
try
{
BaseFont bf = BaseFont.CreateFont("c:\\windows\\fonts\\calibrib.ttf",
BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
Document document = new Document(PageSize.A4, 25, 25, 30, 30);
PdfWriter writer = PdfWriter.GetInstance(document, fs);
Font f = new Font(bf, 12f, Font.NORMAL);
// Open the document to enable you to write to the document
document.Open();
// Add a simple and wellknown phrase to the document
for (int x = 0; x != 100; x++)
{
document.Add(new Paragraph("Paragraph - This is a test! ÇçĞğİıÖöŞşÜü",f));
}
// Close the document
document.Close();
}
catch(Exception)
{
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

get text paragraph from pdf using itextsharp - c#

Related

Watermark a pdf document with itext 7 instead of iTextSharp

Adding Page Navigation links to PDF document using itextsharp [duplicate]

Remove Javascript from PDF using iTextSharp

iTextSharp HTMLWorker ParseHTML Tablestyle and PDFStamper

Html to pdf some characters are missing (itextsharp)

Categories

Resources