How to extract text from a PDF and decode characters?

How to extract text from a PDF and decode characters? - c#

I am using itextsharp to extract text from a pdf document using this code:
public static bool does_document_text_have_keyword(string keyword,
string pdf_src, Report report_object) // TEST
{
try
{
PdfReader pdfReader = new PdfReader(pdf_src);
string currentText;
int count = pdfReader.NumberOfPages;
for (int page = 1; page <= count; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
currentText = PdfTextExtractor.GetTextFromPage
(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString
(ASCIIEncoding.Convert
(Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
report_object.log(currentText); // TEST
if (currentText.IndexOf
(keyword, StringComparison.OrdinalIgnoreCase) != -1) return true;
}
pdfReader.Close();
return false;
}
catch
{
return false;
}
}
But the problem is, when I extract text, the text has no white spaces, it's as if the white spaces has been replaced with an empty string. Yet in the pdf document, there are white spaces in it. Does anyone know whats happening here?

I believe your issue is the SimpleTextExtractionStrategy. From the API documentation at http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/SimpleTextExtractionStrategy.html
If the PDF renders text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.
Try using the LocationTextExtractionStrategy. It's documentation states:
A text extraction renderer that keeps track of relative position of text on page The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.

Related

Text extraction using itext7: garbage characters for some pdf documents

I have a problem extracting text from pdf documents using iText7. For documents coming from a specific source textRenderInfo.GetText() returns only garbage chars (0xfdff) in the event handler of my extraction strategy:
internal class CustomExtractionStrategy : ITextExtractionStrategy
{
public virtual void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals((object)EventType.RENDER_TEXT))
{
return;
}
var textRenderInfo = (TextRenderInfo)data;
bool currentResultEmpty = _result.Length == 0;
bool isInNewLine = false;
var baseline = textRenderInfo.GetBaseline();
var startPoint = baseline.GetStartPoint();
var endPoint = baseline.GetEndPoint();
var currentText = textRenderInfo.GetText(); // returns garbage for specific pdfs
// further processing below
...
}
}
I'm not very familiar with the way text/glyph encoding words in PDF but I try to give some details when comparing the problematic pdfs with an example where extraction works. For the pdfs with issues:
textRenderInfo.gs.font is MS-UIGothic
textRenderInfo.gs.font.fontProgram.codeToGlyph contains only mapping (key: 0 to a Glyph with width 1000, unicode -1, code 0)
textRenderInfo.gs.font.fontProgram.unicodeToGlyph contains no records
These are the most obvious discrepancies. If there's any thing else I should look out for please let me know. I would have provided an example of the PDF in question but it might have sensitive information that I must not disclose.
Note: the PDFs can be correctly read in Acrobat Reader and I can copy text from the reader into notepad. Other libraries (pdfium based or ports of PDFBox) can properly extract text from the document. So I think the document as such is "valid".
If this is a known issue for iText7, is there any workaround (other than using a different library altogether)?
Update
With the link provided in the comment and the following code (in addition to the custom extraction strategy snippet shown above) I get garbage chars see VS screenshot:
internal class PdfExtractor
{
internal void ExtractFromPath(string path)
{
PdfReader reader = new PdfReader(path);
var document = new iText.Kernel.Pdf.PdfDocument(reader);
for (int pageNum = 1; pageNum <= document.GetNumberOfPages(); pageNum++)
{
var page = document.GetPage(pageNum);
string text = PdfTextExtractor.GetTextFromPage(page, new CustomExtractionStrategy());
}
}
}

Reading PDF in net core with itext7 returns "\n\n\n\n\n...."

i have a netcore 3 app to read and split a PDF containing paychecks of some companies which i am working for.
This app ran pretty well since last builds... my the way, the PDF reader started to fail to parse the contents of any PDF.
PDF is built only with Italian words, no special chars. Few tables and a single logo. I'm not able to attach it due to privacy.
public PaycheckSplitter Read()
{
using (var reader = new PdfReader(new MemoryStream(this._stream)))
{
var doc = new PdfDocument(reader);
this.Paycheck = new PaychecksCollection();
for (int i = 1; i <= doc.GetNumberOfPages(); i++)
{
PdfPage page = doc.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, new LocationTextExtractionStrategy());
if (text.Contains(Consts.BpEnd)) break;
// trying to find something by regex... btw text contains only a sequence of \n\n\n\n...
string cf = Consts.CodFiscale.Match(text).Value;
this.Paychecks.Add(new Paycheck(cf), i);
}
doc.Close();
}
return this;
}
Anything i can do?
As far as i can see... the only and best way to have something to read a PDF text for free is iText7...

PDFsharp: Replace a string using PDFsharp

This question is already present but doesn't provide the answer using PDFsharp but iTextPDF.
Now coming back to question, I know a way to read and extract the String. But I'm having trouble REPLACING the text.
My Code:
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
text = text.Replace("Replace This", "With This");
XFont font = new XFont("Times New Roman", 11, XFontStyle.BoldItalic);
gfx.DrawString(text, font, XBrushes.Black, new XRect(0, 0, page.Width, page.Height), XStringFormats.Left);
// Save the document...
const string filename = "New Doc.pdf";
document.Save(filename);
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = cObject as CString;
yield return cString.Value;
}
}
This is a sample code and this one would ignore the graphics and images. And end up writing only text in the output file. Is there way I can replace the text without touching Graphics and Images in the content?

The sample seems to be a wrong approach: it returns text only, but ignores graphics, images, and even text positions and text attributes.
You can try to locate the text instructions (TJ, Tj) in the content and replace them with new instructions (also TJ or Tj) without touching anything else in the stream. Such a simple approach would lead to overlapping text or large gaps if the new text has a different lengths.
PDFsharp was not designed to parse the content streams. You have to write your own code to extract text, you have to write your own code to modify text (or use a third-party library that was built on PDFsharp).
To answer your question: yes, there is a way (as outlined above), but you will have to write a whole lot of code to achieve this (or find suitable code written by a third party).

iTextSharp can't read some PDF files

I have a problem to read and display content of some PDFs into RichTextBox.
I use the following code:
string fileName = #"C:\Users\PC\Desktop\SomePdf.pdf";
string str = string.Empty;
PdfReader reader = new PdfReader(fileName);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, i, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
str = str + s;
rtbVsebina.Text = str;
}
reader.Close();
Some PDFs can be read and displayed into RichTextBox and some they can not be. For those that can not be read I only get empty RichTextBox but with some added lines as I would press Key 'Enter' on the keyboard a couple of times.
Does anybody know what could be wrong?

You are confusing page content with page annotations.
Page content is part of the content stream of a page. It's referred to in the /Contents entry of the page dictionary and (optionally) in external objects (aka XObjects). With the code snippet you have copy/pasted in your question, you are extracting this content.
A rich text box is one of the many types of annotations. Annotations are not part of the content stream of a page. They are referred to from the /Annots entry of the page dictionary. If you want to get the contents of an annotation, you need to ask the page for its annotations instead of parsing the content of the page. See for instance Reading PDF Annotations with iText.
In answer to your question "What am I doing wrong": you were looking at the wrong place.

Reading pdf content using iTextSharp in C#

I use this code to read pdf content using iTextSharp. it works fine when content is english but it doesn't work whene content is Persian or Arabic Result is something like this :
Here is sample non-English PDF for test.
ÙŽÙ›Ù†Ø§ ÙÙ”Ø¨Ù˜Ø·Ø« ÛŒØ¿ÛŒÙ›Ù˜ Ø²Ø¾Ø§ ÙÙ›ÙØÙ” Ù‚Ù›Ù…Ø
ÛŒÙ”Ø¨Ù•Ø³ Â© Karl Seguin foppersian.codeplex.com
www.codebetter.com 1 1 ÙÙ”Ø¨Ù˜Ø·Ø« ÙŽÙ›Ù†Ø§ ÛŒØ¿ÛŒÙ›Ù˜
Ù‡Ù…Ø§Ù†Ø±Ø¨ Ù„ÙˆØµØ§ ÛŒØ³ÛŒÙˆÙ† Ù…Ø±Ù† Ø¯ÛŒÙ„ÙˆØª Ø±ØªÙ‡Ø¨ Ø±Ø§Ø²ÙØ§
What is the solution ?
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
text.Append(currentText);
pdfReader.Close();
}
}
return text.ToString();
}

In .Net, once you have a string, you have a string, and it is Unicode, always. The actual in-memory implementation is UTF-16 but that doesn't matter. Never, ever, ever decompose the string into bytes and try to reinterpret it as a different encoding and slap it back as a string because that doesn't make sense and will almost always fail.
Your problem is this line:
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
I'm going to pull it apart into a couple of lines to illustrate:
byte[] bytes = Encoding.UTF8.GetBytes("ی"); //bytes now holds 0xDB8C
byte[] converted = Encoding.Convert(Encoding.Default, Encoding.UTF8, bytes);//converted now holds 0xC39BC592
string final = Encoding.UTF8.GetString(converted);//final now holds ÛŒ
The code will mix up anything above the 127 ASCII barrier. Drop the re-encoding line and you should be good.
Side-note, it is totally possible that whatever creates a string does it incorrectly, that's not too uncommon actually. But you need to fix that problem before it becomes a string, at the byte level.
EDIT
The code should be the exact same as yours above except that one line should be removed. Also, whatever you're using to display the text in, make sure that it supports Unicode. Also, as #kuujinbo said, make sure that you're using a recent version of iTextSharp. I tested this with 5.2.0.0.
public string ReadPdfFile(string fileName) {
StringBuilder text = new StringBuilder();
if (File.Exists(fileName)) {
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++) {
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
EDIT 2
The above code fixes the encoding issue but doesn't fix the order of the strings themselves. Unfortunately this problem appears to be at the PDF level itself.
Consequently, showing text in such right-to-left writing systems
requires either positioning each glyph individually (which is tedious
and costly) or representing text with show strings (see 9.2,
“Organization and Use of Fonts”) whose character codes are given in
reverse order.
PDF 2008 Spec - 14.8.2.3.3 - Reverse-Order Show Strings
When re-ordering strings such as above, content is (if I understand the spec correctly) supposed to use a "marked content" section, BMC. However, the few sample PDFs that I've looked at and generated don't appear to actually do this. I absolutely could be wrong on this part because this is very much not my specialty so you'll have to poke around so more.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to extract text from a PDF and decode characters? - c#

Related

Text extraction using itext7: garbage characters for some pdf documents

Reading PDF in net core with itext7 returns "\n\n\n\n\n...."

PDFsharp: Replace a string using PDFsharp

iTextSharp can't read some PDF files

Reading pdf content using iTextSharp in C#

Categories

Resources