I use this code to read pdf content using iTextSharp. it works fine when content is english but it doesn't work whene content is Persian or Arabic Result is something like this :
Here is sample non-English PDF for test.
َٛنا Ùٔب٘طث یؿیٛ٘ زؾا ÙÙ›ÙØÙ” Ù‚Ù›Ù…Ø
یٔبٕس © Karl Seguin foppersian.codeplex.com
www.codebetter.com 1 1 Ùٔب٘طث َٛنا یؿیٛ٘
همانرب لوصا یسیون مرن دیلوت رتهب رازÙا
What is the solution ?
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
text.Append(currentText);
pdfReader.Close();
}
}
return text.ToString();
}
In .Net, once you have a string, you have a string, and it is Unicode, always. The actual in-memory implementation is UTF-16 but that doesn't matter. Never, ever, ever decompose the string into bytes and try to reinterpret it as a different encoding and slap it back as a string because that doesn't make sense and will almost always fail.
Your problem is this line:
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
I'm going to pull it apart into a couple of lines to illustrate:
byte[] bytes = Encoding.UTF8.GetBytes("ی"); //bytes now holds 0xDB8C
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);//converted now holds 0xC39BC592
string final = Encoding.UTF8.GetString(converted);//final now holds ی
The code will mix up anything above the 127 ASCII barrier. Drop the re-encoding line and you should be good.
Side-note, it is totally possible that whatever creates a string does it incorrectly, that's not too uncommon actually. But you need to fix that problem before it becomes a string, at the byte level.
EDIT
The code should be the exact same as yours above except that one line should be removed. Also, whatever you're using to display the text in, make sure that it supports Unicode. Also, as #kuujinbo said, make sure that you're using a recent version of iTextSharp. I tested this with 5.2.0.0.
public string ReadPdfFile(string fileName) {
StringBuilder text = new StringBuilder();
if (File.Exists(fileName)) {
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++) {
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
EDIT 2
The above code fixes the encoding issue but doesn't fix the order of the strings themselves. Unfortunately this problem appears to be at the PDF level itself.
Consequently, showing text in such right-to-left writing systems
requires either positioning each glyph individually (which is tedious
and costly) or representing text with show strings (see 9.2,
“Organization and Use of Fonts”) whose character codes are given in
reverse order.
PDF 2008 Spec - 14.8.2.3.3 - Reverse-Order Show Strings
When re-ordering strings such as above, content is (if I understand the spec correctly) supposed to use a "marked content" section, BMC. However, the few sample PDFs that I've looked at and generated don't appear to actually do this. I absolutely could be wrong on this part because this is very much not my specialty so you'll have to poke around so more.
Related
i have a netcore 3 app to read and split a PDF containing paychecks of some companies which i am working for.
This app ran pretty well since last builds... my the way, the PDF reader started to fail to parse the contents of any PDF.
PDF is built only with Italian words, no special chars. Few tables and a single logo. I'm not able to attach it due to privacy.
public PaycheckSplitter Read()
{
using (var reader = new PdfReader(new MemoryStream(this._stream)))
{
var doc = new PdfDocument(reader);
this.Paycheck = new PaychecksCollection();
for (int i = 1; i <= doc.GetNumberOfPages(); i++)
{
PdfPage page = doc.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, new LocationTextExtractionStrategy());
if (text.Contains(Consts.BpEnd)) break;
// trying to find something by regex... btw text contains only a sequence of \n\n\n\n...
string cf = Consts.CodFiscale.Match(text).Value;
this.Paychecks.Add(new Paycheck(cf), i);
}
doc.Close();
}
return this;
}
Anything i can do?
As far as i can see... the only and best way to have something to read a PDF text for free is iText7...
I'm trying to evaluate various libraries to read PDF files. One of them is iText 7, the 7.0.4 .NET version to be precise. Some files work fine, but there is at least one file i tested it with, where iText simply returns gibberish (most files work just fine).
This is my code:
private void PdfToText(FileInfo pdfFileInfo, FileInfo textFileInfo)
{
var textFile = new StreamWriter(textFileInfo.FullName);
var pdfDocument = new PdfDocument(new PdfReader(pdfFileInfo.FullName));
var strategy = new LocationTextExtractionStrategy();
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
{
var page = pdfDocument.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, strategy);
textFile.Write(text);
}
pdfDocument.Close();
textFile.Close();
}
The resulting file starts with this in hex (and goes on like this as well):
Other libraries can extract the text from this file just fine, selecting the text with Foxit Reader and then use copy paste to Notepad++ also gives me readable text.
I'm sorry, but i can't provide the PDF in question, since it contains confidential data.
Any idea how to fix this?
I am generating a PDF from an HTML string.
When this string is really long, I would like to create a new page, split the text (without breaking the html) and so on.
Here is my code :
// instantiate Pdf object
Aspose.Pdf.Generator.Pdf pdf = new Aspose.Pdf.Generator.Pdf();
// specify the Character encoding for for HTML file
pdf.HtmlInfo.CharSet = "UTF-8";
pdf.HtmlInfo.Margin.Left = 10;
pdf.HtmlInfo.Margin.Right = 10;
pdf.HtmlInfo.PageHeight = 1050;
pdf.HtmlInfo.PageWidth = 730;
pdf.HtmlInfo.ShowUnknownHtmlTagsAsText = true;
pdf.HtmlInfo.TryEnlargePredefinedTableColumnWidthsToAvoidWordBreaking = true;
pdf.HtmlInfo.CharsetApplyingLevelOfForce = Aspose.Pdf.Generator.HtmlInfo.CharsetApplyingForceLevel.UseWhenImpossibleDetectFromContent;
// bind the source HTML
pdf.BindHTML("MyVeryVeryLongHTML");
MemoryStream stream = new MemoryStream();
pdf.Save(stream);
byte[] pdfBytes = stream.ToArray();
This code works for the HTML, but the overflow is not handled. The text continue after the page. Is it possible to set a max "height" of the page to not cross, and if it does, it recreates a new page ?
Hope it makes sense !
Thanks a lot
You can set the Page height by selecting type of PDF page you require like A1, A2, etc . Afterwords , your problem of page height will automatically be taken care by the Aspose. For more refer the link..
Aspose PDF Page Height
Update
update pdf.HtmlInfo to pdf.PageSetup (or pdf.PageInfo) and add bottom margin also.
I have a problem to read and display content of some PDFs into RichTextBox.
I use the following code:
string fileName = #"C:\Users\PC\Desktop\SomePdf.pdf";
string str = string.Empty;
PdfReader reader = new PdfReader(fileName);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, i, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
str = str + s;
rtbVsebina.Text = str;
}
reader.Close();
Some PDFs can be read and displayed into RichTextBox and some they can not be. For those that can not be read I only get empty RichTextBox but with some added lines as I would press Key 'Enter' on the keyboard a couple of times.
Does anybody know what could be wrong?
You are confusing page content with page annotations.
Page content is part of the content stream of a page. It's referred to in the /Contents entry of the page dictionary and (optionally) in external objects (aka XObjects). With the code snippet you have copy/pasted in your question, you are extracting this content.
A rich text box is one of the many types of annotations. Annotations are not part of the content stream of a page. They are referred to from the /Annots entry of the page dictionary. If you want to get the contents of an annotation, you need to ask the page for its annotations instead of parsing the content of the page. See for instance Reading PDF Annotations with iText.
In answer to your question "What am I doing wrong": you were looking at the wrong place.
I am using itextsharp to extract text from a pdf document using this code:
public static bool does_document_text_have_keyword(string keyword,
string pdf_src, Report report_object) // TEST
{
try
{
PdfReader pdfReader = new PdfReader(pdf_src);
string currentText;
int count = pdfReader.NumberOfPages;
for (int page = 1; page <= count; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
currentText = PdfTextExtractor.GetTextFromPage
(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString
(ASCIIEncoding.Convert
(Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
report_object.log(currentText); // TEST
if (currentText.IndexOf
(keyword, StringComparison.OrdinalIgnoreCase) != -1) return true;
}
pdfReader.Close();
return false;
}
catch
{
return false;
}
}
But the problem is, when I extract text, the text has no white spaces, it's as if the white spaces has been replaced with an empty string. Yet in the pdf document, there are white spaces in it. Does anyone know whats happening here?
I believe your issue is the SimpleTextExtractionStrategy. From the API documentation at http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/SimpleTextExtractionStrategy.html
If the PDF renders text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.
Try using the LocationTextExtractionStrategy. It's documentation states:
A text extraction renderer that keeps track of relative position of text on page The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.