getting text from a pdf document itextsharp

getting text from a pdf document itextsharp - c#

i tried using iTextSharp to get the text from a pdf document,
it works great if the pdf file is with english text(latin chars).
If i try to get the text from a pdf doc with cyrillic characters the output is just question marks. Are there some settings to be made, or cyrillic isnt supported?
this is the code for creating the pdf:
string testText = "зззi";
string tmpFile = #"C:\items\test.pdf";
string myFont = #"C:\windows\fonts\verdana.ttf";
iTextSharp.text.Rectangle pgeSize = new iTextSharp.text.Rectangle(595, 792);
iTextSharp.text.Document doc = new iTextSharp.text.Document(pgeSize, 10, 10, 10, 10);
iTextSharp.text.pdf.PdfWriter wrtr;
wrtr = iTextSharp.text.pdf.PdfWriter.GetInstance(doc,
new System.IO.FileStream(tmpFile, System.IO.FileMode.Create));
doc.Open();
doc.NewPage();
iTextSharp.text.pdf.BaseFont bfR;
bfR = BaseFont.CreateFont(myFont, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
iTextSharp.text.BaseColor clrBlack =
new iTextSharp.text.BaseColor(0, 0, 0);
iTextSharp.text.Font fntHead =
new iTextSharp.text.Font(bfR, 34, iTextSharp.text.Font.NORMAL, clrBlack);
iTextSharp.text.Paragraph pgr =
new iTextSharp.text.Paragraph(testText, fntHead);
doc.Add(pgr);
doc.Close();
this is the code for retrieving the text:
PdfReader reader1
= new PdfReader("c:/items/test.pdf");
Console.WriteLine(PdfTextExtractor.GetTextFromPage(reader1, 1, new SimpleTextExtractionStrategy()));
Console.ReadLine();
the output is: ???i
EDIT 2
i managed to read text from the pdf i created, but still cant get the text from a random pdf. How can i check if that pdf provides the required info for text extraction?

Related

Itextsharp chinese character not shown in PDF

I need so help with my coding. I am trying to convert a HTML with Chinese Character to PDF. I manage to Convert but my chinese characters has disappear
This is my HTML file which i convert to string and i have set font with Arial Unicode MS in the td
string HTMLTemplate = "<table border=0 cellspacing='0' cellpadding='3'><tr><td style='width:100%;font: 10px/1.5em Verdana, Arial Unicode MS, Helvetica, sans-serif;'>GIGI无合约 F&B‎</td></tr></table>"
This is my code
XMLWorkerFontProvider fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
fontProvider.Register("C:\\fonts\\ARIALUNI.TTF");
CssAppliers cssAppliers = new CssAppliersImpl(fontProvider);
iTextSharp.text.Document doc = new iTextSharp.text.Document(iTextSharp.text.PageSize.LETTER, 7, 7, 7, 7);
iTextSharp.text.pdf.PdfWriter writer = iTextSharp.text.pdf.PdfWriter.GetInstance(doc, new FileStream(FilePath, FileMode.Create));
iTextSharp.tool.xml.pipeline.html.HtmlPipelineContext htmlContext = new iTextSharp.tool.xml.pipeline.html.HtmlPipelineContext(cssAppliers);
htmlContext.SetTagFactory(iTextSharp.tool.xml.html.Tags.GetHtmlTagProcessorFactory());
//create a cssresolver to apply css
iTextSharp.tool.xml.pipeline.css.ICSSResolver cssResolver = iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().GetDefaultCssResolver(false);
//Create and attach pipline, without pipline parser will not work on css
iTextSharp.tool.xml.IPipeline pipeline = new iTextSharp.tool.xml.pipeline.css.CssResolverPipeline(cssResolver, new iTextSharp.tool.xml.pipeline.html.HtmlPipeline(htmlContext, new iTextSharp.tool.xml.pipeline.end.PdfWriterPipeline(doc, writer)));
//Create XMLWorker and attach a parser to it
iTextSharp.tool.xml.XMLWorker worker = new iTextSharp.tool.xml.XMLWorker(pipeline, true);
iTextSharp.tool.xml.parser.XMLParser xmlParser = new iTextSharp.tool.xml.parser.XMLParser(worker);
//All is well open documnet and start writing.
doc.Open();
xmlParser.Parse(new StringReader(AP_TEMPLATE_HTML));
doc.NewPage();
doc.Close();

Multiple images and texts on same line using iTextSharp

I am trying to create a PDF file on the fly using some backend data and at the bottom of the file I need to include 2 images and a text: a signature image on the left, company logo in the middle and date (underlined with the word "Date" below the line).
I have searched and managed to do two: image on the left and date on the right but now I am stuck trying to get another image in between!
This is what I currently have (assuming I have already read user's name and test date from DB):
using (MemoryStream ms = new MemoryStream())
{
Document doc = new Document(PageSize.LETTER.Rotate());
doc.Open();
// Set path to PDF file and images used in the file
string sPath = Server.MapPath("PDF");
string sImagePath = Server.MapPath("assets/Images");
// Set PDF file name
string sFileName = "SomeFileNameGenerated" + ".PDF";
// Open document
PdfWriter myPDFWriter = PdfWriter.GetInstance(doc, ms);
doc.Open();
// Set some font styles
BaseFont bfTimes = BaseFont.CreateFont(BaseFont.TIMES_ROMAN, BaseFont.CP1252, false);
Font fntDate = new Font(Font.FontFamily.TIMES_ROMAN, 12f, Font.NORMAL | Font.UNDERLINE, BaseColor.BLACK);
Font fntUserName = new Font(Font.FontFamily.TIMES_ROMAN, 24f, Font.BOLD | Font.UNDERLINE, BaseColor.BLACK);
// Add header image
iTextSharp.text.Image imgHeader = iTextSharp.text.Image.GetInstance(sImagePath + "/headerImage.png");
doc.Add(imgHeader);
iTextSharp.text.Image imgSignature = iTextSharp.text.Image.GetInstance(sImagePath + "/Signature.png");
// Add user's name, underlined, in the center
Paragraph pUserName = new Paragraph(sUserName, fntUserName);
pUserName.Alignment = Element.ALIGN_CENTER;
doc.Add(pUserName);
// Here I am trying to add two images and one date text;
// below I have added one image and one text; need to have another image in between
Paragraph para = new Paragraph();
Phrase ph1 = new Phrase();
Chunk glue = new Chunk(new iTextSharp.text.pdf.draw.VerticalPositionMark());
Paragraph main = new Paragraph();
ph1.Add(new Chunk(imgSignature, 0, 0, true)); // Here I add signature image as a chunk into Phrase.
ph1.Add(glue); // Here I add special chunk to the same phrase.
ph1.Add(new Chunk(sTestDate, fntDate)); // Here I add date as a chunk into same phrase.
main.Add(ph1);
para.Add(main);
doc.Add(para);
doc.Close();
}

why courier font not working in iText PDF document?

Using the following code to create a PDF document in C# using iText 5. The text does not render in the courier font. Why not?
private void SimpleFontDoc(string pdfDocPath)
{
Document doc = new Document(PageSize.LETTER, 10, 10, 42, 30);
var fs = new FileStream(pdfDocPath, FileMode.Create);
PdfWriter writer = PdfWriter.GetInstance(doc, fs);
doc.Open();
string[] lines = new string[]
{
"First text line",
"Second text line"
};
var font = FontFactory.GetFont("courier", 12.0f, BaseColor.BLACK);
foreach (var line in lines)
{
var para = new iTextSharp.text.Paragraph(line);
para.Font = font;
doc.Add(para);
}
doc.Close();
}

In iText5 you have to specify the font before adding text to the Paragraph element (or alternatively pass it to the constructor).
Change
var para = new iTextSharp.text.Paragraph(line);
para.Font = font;
into
var para = new iTextSharp.text.Paragraph(line, font);

arabic encoding in itextsharp

When I'm trying this code to create a PDF in Arabic using C#, the generated PDF file contains discrete characters. Any help so I can't get continuous characters?
//Create our document object
Document Doc = new Document(PageSize.LETTER);
//Create our file stream
using (FileStream fs = new FileStream(Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Test.pdf"), FileMode.Create, FileAccess.Write, FileShare.Read))
{
//Bind PDF writer to document and stream
PdfWriter writer = PdfWriter.GetInstance(Doc, fs);
//Open document for writing
Doc.Open();
//Add a page
Doc.NewPage();
//Full path to the Unicode Arial file
string ARIALUNI_TFF = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "arabtype.TTF");
//Create a base font object making sure to specify IDENTITY-H
BaseFont bf = BaseFont.CreateFont(ARIALUNI_TFF, BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
//Create a specific font object
iTextSharp.text.Font f = new iTextSharp.text.Font(bf, 12, iTextSharp.text.Font.NORMAL);
//Write some text, the last character is 0x0278 - LATIN SMALL LETTER PHI
Doc.Add(new Phrase("This is a ميسو ɸ", f));
//Write some more text, the last character is 0x0682 - ARABIC LETTER HAH WITH TWO DOTS VERTICAL ABOVE
Doc.Add(new Phrase("Hello\u0682", f));
//Close the PDF
Doc.Close();
}

Right-to-left writing and Arabic ligatures are only supported in ColumnText and PdfPTable!
Take a look at the following examples:
http://itextpdf.com/examples/iia.php?id=205 with result: say_peace.pdf
http://itextpdf.com/examples/iia.php?id=215 with result: peace.pdf
Read the book from which these examples are taken to find out why not all words are rendered correctly in peace.pdf
Search http://tinyurl.com/itextsharpIIA2C11 for the corresponding C# version of these examples.
In any case, you need a font that knows how to display Arabic glyphs:
BaseFont bf = BaseFont.CreateFont(ARIALUNI_TFF, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
Font f = new Font(bf, 12);
You can now add Arabic text, for instance in a table:
PdfPCell cell = new PdfPCell();
cell.AddElement(new Phrase("Hello\u0682", f));
cell.RunDirection = PdfWriter.RUN_DIRECTION_RTL;

Html to pdf some characters are missing (itextsharp)

I want to export gridview to pdf by using the itextsharp library. The problem is that some turkish characters such as İ,ı,Ş,ş etc... are missing in the pdf document. The code used to export the pdf is:
protected void LinkButtonPdf_Click(object sender, EventArgs e)
{
Response.ContentType = "application/pdf";
Response.ContentEncoding = System.Text.Encoding.UTF8;
Response.AddHeader("content-disposition", "attachment;filename=FileName.pdf");
Response.Cache.SetCacheability(HttpCacheability.NoCache);
System.IO.StringWriter stringWrite = new StringWriter();
System.Web.UI.HtmlTextWriter htmlWrite = new HtmlTextWriter(stringWrite);
GridView1.RenderControl(htmlWrite);
StringReader reader = new StringReader(textConvert(stringWrite.ToString()));
Document doc = new Document(PageSize.A4);
HTMLWorker parser = new HTMLWorker(doc);
PdfWriter.GetInstance(doc, Response.OutputStream);
doc.Open();
parser.Parse(reader);
doc.Close();
}
public static string textConvert(string S)
{
if (S == null) { return null; }
try
{
System.Text.Encoding encFrom = System.Text.Encoding.UTF8;
System.Text.Encoding encTo = System.Text.Encoding.UTF8;
string str = S;
Byte[] b = encFrom.GetBytes(str);
return encTo.GetString(b);
}
catch { return null; }
}
Note: when I want to insert characters into the pdf document, the missing characters are shown in it. I insert the characters with this code:
BaseFont bffont = BaseFont.CreateFont("C:\\WINDOWS\\Fonts\\arial.ttf", BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
Font fontozel = new Font(bffont, 12, Font.NORMAL, new Color(0, 0, 0));
doc.Add(new Paragraph("İİııŞŞşşĞĞğğ", fontozel));

Finaly I think I found the solution,I changed itextsharp source code a little in order to show turkish characters.(turkish character code is cp1254)
I add "public const string CP1254 = "Cp1254";" to [BaseFont.cs] in the source code.
After that I modify the [FactoryProperties.cs].I changed like this;
public Font GetFont(ChainedProperties props)
{
I don't write the whole code.I changed only code below;
------------Default itextsharp code------------------------------------------------------
if (encoding == null)
encoding = BaseFont.WINANSI;
return fontImp.GetFont(face, encoding, true, size, style, color);
-------------modified code--------------------------------------------
encoding = BaseFont.CP1254;
return fontImp.GetFont("C:\\WINDOWS\\Fonts\\arial.ttf", encoding, true, size, style, color);
}
.After I compile new dll ,and missing characters are shown.

No need to change the source code.
Try this:
iTextSharp.text.pdf.BaseFont STF_Helvetica_Turkish = iTextSharp.text.pdf.BaseFont.CreateFont("Helvetica","Cp1254", iTextSharp.text.pdf.BaseFont.NOT_EMBEDDED);
iTextSharp.text.Font fontNormal = new iTextSharp.text.Font(STF_Helvetica_Turkish, 12, iTextSharp.text.Font.NORMAL);

thank you very much all who posted the samples..
i use the below solution from codeproject , and there was the turkish char set problems due to font..
If you use htmlworker you should register font and pass to htmlworker
http://www.codeproject.com/Articles/260470/PDF-reporting-using-ASP-NET-MVC3
StyleSheet styles = new iTextSharp.text.html.simpleparser.StyleSheet();
styles.LoadTagStyle("h3", "size", "5");
styles.LoadTagStyle("td", "size", ".6");
FontFactory.Register("c:\\windows\\fonts\\arial.ttf", "Garamond"); // just give a path of arial.ttf
styles.LoadTagStyle("body", "face", "Garamond");
styles.LoadTagStyle("body", "encoding", "Identity-H");
styles.LoadTagStyle("body", "size", "12pt");
using (var htmlViewReader = new StringReader(htmlText))
{
using (var htmlWorker = new HTMLWorker(pdfDocument, null, styles))
{
htmlWorker.Parse(htmlViewReader);
}
}

I am not familiar with the iTextSharp library; however, you seem to be converting the output of your gridview component to a string and reading from that string to construct your PDF document. You also have a strange conversion from UTF-8 to UTF-8 going on.
From what I can see (given that your GridView is outputting characters correctly) if you are outputting the characters to a string they would be represented as UTF-16 in memory. You probably need to pass this string directly into the PDF library (like how you pass the raw UTF-16 .NET string "İııŞŞşşĞĞğğ" as it is).

You can use:
iTextSharp.text.pdf.BaseFont Vn_Helvetica = iTextSharp.text.pdf.BaseFont.CreateFont(#"C:\Windows\Fonts\arial.ttf", "Identity-H", iTextSharp.text.pdf.BaseFont.EMBEDDED);
iTextSharp.text.Font fontNormal = new iTextSharp.text.Font(Vn_Helvetica, 12, iTextSharp.text.Font.NORMAL);

For Turkish encoding
CultureInfo ci = new CultureInfo("tr-TR");
Encoding enc = Encoding.GetEncoding(ci.TextInfo.ANSICodePage);
If you're outputting HTML, try different DOCTYPE tags at the top of the page.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
Note if using HTML you may need to HTMLEncode the characters.
Server.HTMLEncode()
HttpServerUtility.HtmlEncode()

BaseFont bF = BaseFont.CreateFont("c:\\arial.ttf","windows-1254",true);
Font f = new Font(bF,12f,Font.NORMAL);
Chunk c = new Chunk();
c.Font = f;
c.Append("Turkish characters: ĞÜŞİÖÇ ğüşıöç");
document.Add(c);
In the first line, you may write these instead of "windows-1254". All works:
Cp1254
iso-8859-9
windows-1254

Don't change the source code of the iTextSharp. Define a new style:
var styles = new StyleSheet();
styles.LoadTagStyle(HtmlTags.BODY, HtmlTags.FONTFAMILY, "tahoma");
styles.LoadTagStyle(HtmlTags.BODY, HtmlTags.ENCODING, "Identity-H");
and then pass it to the HTMLWorker.ParseToList method.

i have finally find a soultution for this problem , by this you can print all turkish character.
String htmlText = html.ToString();
Document document = new Document();
string filePath = HostingEnvironment.MapPath("~/Content/Pdf/");
PdfWriter.GetInstance(document, new FileStream(filePath + "\\pdf-"+Name+".pdf", FileMode.Create));
document.Open();
iTextSharp.text.html.simpleparser.HTMLWorker hw = new iTextSharp.text.html.simpleparser.HTMLWorker(document);
FontFactory.Register(Path.Combine(_webHelper.MapPath("~/App_Data/Pdf/arial.ttf")), "Garamond"); // just give a path of arial.ttf
StyleSheet css = new StyleSheet();
css.LoadTagStyle("body", "face", "Garamond");
css.LoadTagStyle("body", "encoding", "Identity-H");
css.LoadTagStyle("body", "size", "12pt");
hw.SetStyleSheet(css);
hw.Parse(new StringReader(htmlText));

I strongly suggest not to change itextsharp source code in order to solve this problem. Have a look at my other comment on the subject: https://stackoverflow.com/a/24587745/1138663

I solved the problem. I can provide my the other solution type...
try
{
BaseFont bf = BaseFont.CreateFont("c:\\windows\\fonts\\calibrib.ttf",
BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
Document document = new Document(PageSize.A4, 25, 25, 30, 30);
PdfWriter writer = PdfWriter.GetInstance(document, fs);
Font f = new Font(bf, 12f, Font.NORMAL);
// Open the document to enable you to write to the document
document.Open();
// Add a simple and wellknown phrase to the document
for (int x = 0; x != 100; x++)
{
document.Add(new Paragraph("Paragraph - This is a test! ÇçĞğİıÖöŞşÜü",f));
}
// Close the document
document.Close();
}
catch(Exception)
{
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

getting text from a pdf document itextsharp - c#

Related

Itextsharp chinese character not shown in PDF

Multiple images and texts on same line using iTextSharp

why courier font not working in iText PDF document?

arabic encoding in itextsharp

Html to pdf some characters are missing (itextsharp)

Categories

Resources