This question is already present but doesn't provide the answer using PDFsharp but iTextPDF.
Now coming back to question, I know a way to read and extract the String. But I'm having trouble REPLACING the text.
My Code:
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
text = text.Replace("Replace This", "With This");
XFont font = new XFont("Times New Roman", 11, XFontStyle.BoldItalic);
gfx.DrawString(text, font, XBrushes.Black, new XRect(0, 0, page.Width, page.Height), XStringFormats.Left);
// Save the document...
const string filename = "New Doc.pdf";
document.Save(filename);
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = cObject as CString;
yield return cString.Value;
}
}
This is a sample code and this one would ignore the graphics and images. And end up writing only text in the output file. Is there way I can replace the text without touching Graphics and Images in the content?
The sample seems to be a wrong approach: it returns text only, but ignores graphics, images, and even text positions and text attributes.
You can try to locate the text instructions (TJ, Tj) in the content and replace them with new instructions (also TJ or Tj) without touching anything else in the stream. Such a simple approach would lead to overlapping text or large gaps if the new text has a different lengths.
PDFsharp was not designed to parse the content streams. You have to write your own code to extract text, you have to write your own code to modify text (or use a third-party library that was built on PDFsharp).
To answer your question: yes, there is a way (as outlined above), but you will have to write a whole lot of code to achieve this (or find suitable code written by a third party).
Related
I have a problem extracting text from pdf documents using iText7. For documents coming from a specific source textRenderInfo.GetText() returns only garbage chars (0xfdff) in the event handler of my extraction strategy:
internal class CustomExtractionStrategy : ITextExtractionStrategy
{
public virtual void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals((object)EventType.RENDER_TEXT))
{
return;
}
var textRenderInfo = (TextRenderInfo)data;
bool currentResultEmpty = _result.Length == 0;
bool isInNewLine = false;
var baseline = textRenderInfo.GetBaseline();
var startPoint = baseline.GetStartPoint();
var endPoint = baseline.GetEndPoint();
var currentText = textRenderInfo.GetText(); // returns garbage for specific pdfs
// further processing below
...
}
}
I'm not very familiar with the way text/glyph encoding words in PDF but I try to give some details when comparing the problematic pdfs with an example where extraction works. For the pdfs with issues:
textRenderInfo.gs.font is MS-UIGothic
textRenderInfo.gs.font.fontProgram.codeToGlyph contains only mapping (key: 0 to a Glyph with width 1000, unicode -1, code 0)
textRenderInfo.gs.font.fontProgram.unicodeToGlyph contains no records
These are the most obvious discrepancies. If there's any thing else I should look out for please let me know. I would have provided an example of the PDF in question but it might have sensitive information that I must not disclose.
Note: the PDFs can be correctly read in Acrobat Reader and I can copy text from the reader into notepad. Other libraries (pdfium based or ports of PDFBox) can properly extract text from the document. So I think the document as such is "valid".
If this is a known issue for iText7, is there any workaround (other than using a different library altogether)?
Update
With the link provided in the comment and the following code (in addition to the custom extraction strategy snippet shown above) I get garbage chars see VS screenshot:
internal class PdfExtractor
{
internal void ExtractFromPath(string path)
{
PdfReader reader = new PdfReader(path);
var document = new iText.Kernel.Pdf.PdfDocument(reader);
for (int pageNum = 1; pageNum <= document.GetNumberOfPages(); pageNum++)
{
var page = document.GetPage(pageNum);
string text = PdfTextExtractor.GetTextFromPage(page, new CustomExtractionStrategy());
}
}
}
i have a netcore 3 app to read and split a PDF containing paychecks of some companies which i am working for.
This app ran pretty well since last builds... my the way, the PDF reader started to fail to parse the contents of any PDF.
PDF is built only with Italian words, no special chars. Few tables and a single logo. I'm not able to attach it due to privacy.
public PaycheckSplitter Read()
{
using (var reader = new PdfReader(new MemoryStream(this._stream)))
{
var doc = new PdfDocument(reader);
this.Paycheck = new PaychecksCollection();
for (int i = 1; i <= doc.GetNumberOfPages(); i++)
{
PdfPage page = doc.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, new LocationTextExtractionStrategy());
if (text.Contains(Consts.BpEnd)) break;
// trying to find something by regex... btw text contains only a sequence of \n\n\n\n...
string cf = Consts.CodFiscale.Match(text).Value;
this.Paychecks.Add(new Paycheck(cf), i);
}
doc.Close();
}
return this;
}
Anything i can do?
As far as i can see... the only and best way to have something to read a PDF text for free is iText7...
I'm trying to evaluate various libraries to read PDF files. One of them is iText 7, the 7.0.4 .NET version to be precise. Some files work fine, but there is at least one file i tested it with, where iText simply returns gibberish (most files work just fine).
This is my code:
private void PdfToText(FileInfo pdfFileInfo, FileInfo textFileInfo)
{
var textFile = new StreamWriter(textFileInfo.FullName);
var pdfDocument = new PdfDocument(new PdfReader(pdfFileInfo.FullName));
var strategy = new LocationTextExtractionStrategy();
for (int i = 1; i <= pdfDocument.GetNumberOfPages(); ++i)
{
var page = pdfDocument.GetPage(i);
string text = PdfTextExtractor.GetTextFromPage(page, strategy);
textFile.Write(text);
}
pdfDocument.Close();
textFile.Close();
}
The resulting file starts with this in hex (and goes on like this as well):
Other libraries can extract the text from this file just fine, selecting the text with Foxit Reader and then use copy paste to Notepad++ also gives me readable text.
I'm sorry, but i can't provide the PDF in question, since it contains confidential data.
Any idea how to fix this?
I have the following code which prints text file from C# to printer its based on the this article it prints plain text perfect but when I try to print .docx and .pdf file it prints but convert the content to some-kind of encoded characters. How can I fix this to print pdf and doc file?
private void btnPrint_Click(object sender, EventArgs e)
{
// Select the desired printer. ps.Duplex = Duplex.Simplex; // This works
pdocFile.PrinterSettings.PrinterName = cboPrinter.Text;
pdocFile.PrinterSettings.Duplex = Duplex.Horizontal;
// Print the checked files.
foreach (string filename in clbFiles.CheckedItems)
{
Console.WriteLine("Printing: " + filename);
// Get the file's name without the path.
FileInfo file_into = new FileInfo(filename);
string short_name = file_into.Name;
// Set the PrintDocument's name for use by the printer queue.
pdocFile.DocumentName = short_name;
// Read the file's contents.
try
{
FileContents = File.ReadAllText(filename).Trim();
}
catch (Exception ex)
{
MessageBox.Show("Error reading file " + filename +
".\n" + ex.Message);
return;
}
// Print.
pdocFile.Print();
}
MessageBox.Show("Spooled " + clbFiles.CheckedItems.Count +
" files for printing.");
}
//
private string FileContents;
// Print a page of the text file.
private void pdocTextFile_PrintPage(object sender, PrintPageEventArgs e)
{
// Make a font for printing.
using (Font font = new Font("Courier New", 10))
{
// Make a StringFormat to align text normally.
using (StringFormat string_format = new StringFormat())
{
// See how much of the remaining text will fit.
SizeF layout_area = new SizeF(e.MarginBounds.Width, e.MarginBounds.Height);
int chars_fitted, lines_filled;
e.Graphics.MeasureString(FileContents, font,
layout_area, string_format,
out chars_fitted, out lines_filled);
// Print as much as will fit.
e.Graphics.DrawString(
FileContents.Substring(0, chars_fitted),
font, Brushes.Black, e.MarginBounds,
string_format);
// Remove the printed text from the string.
FileContents = FileContents.Substring(chars_fitted).Trim();
}
}
// See if we are done.
e.HasMorePages = FileContents.Length > 0;
}
see Image link below
Your example above is taking a binary file format and trying to print it using a method that uses plain text, which will not work. You have a few options on how you could approach this.
Some printers allow you to submit various file types directly to them over a protocol like FTP. And example of this can be seen here. This method works great in enterprise environments which have business printers but is limited to the file types supported by each printer, and each printer's unique requirements.
For some formats, you can use third-party libraries like iText in your C# code to handle the actual printing. This option gives you a ton of control over the formatting, with the overhead of having to maintain additional code for every file type you wish to support.
You can also use the example code posted here to utilize already installed applications. In this example, it takes advantage of the Print verb made available by Adobe Acrobat, Word, etc. You'll need to make sure the applications have their defaults and surface the correct verb (which typically correlates with the context menu when right-clicking on a file name). This method is probably the most straight-forward option to handle files as-is.
Pretty simple one I hope. I have an article of text that I want to display in a window. Now rather than have this massive load of text in the centre of my code, can I add it as a Resource and read it out to the window somehow?
For those asking why, it's simply because it is a massive article and would be very ugly looking stuck in the middle of my code.
UPDATE FOR H.B.
I have tried a number of different approaches to this and am currently looking into the GetManifestResourceStream and using an embeddedResource (txt file) and writing that out to screen. Haven't finished testing it yet but if it works it would be a heck of a lot nicer than copying and pasting the entire text txtbox1.Text = "...blah blah blah".
_textStreamReader = new
StreamReader(Assembly.GetExecutingAssembly().GetManifestResourceStream("Problem.Explaination.txt"));
try
{
if (_textStreamReader.Peek() != -1)
{
txtBlock.Text = _textStreamReader.ReadLine();
}
}
catch
{
MessageBox.Show("Error writing text!");
}
My query remains, is there a better way of achieving this (assuming this is even successful)
Thanks
NOTE
In my example above I only want one line of text. If you were adapting this to read a number of lines from a file you would change it like so;
StreamReader _textStreamReader;
_textStreamReader = new StreamReader(Assembly.GetExecutingAssembly().GetManifestResourceStream("Problem.Explaination.txt"));
var fileContents = _textStreamReader.ReadToEnd();
_textStreamReader.Close();
String[] lines = fileContents.Split("\n"[0]);
String[] lines2;
Int16 count;
foreach (string line in lines)
{
txtBlock.Text += line;
}
Add the file as a resource and, in your code, load it into a string.
StringBuilder sb = new StringBuilder();
using (var stream = this.GetType().Assembly.GetManifestResourceStream("MyNamespace.TextFile.txt"))
using(var reader = new StreamReader(stream))
{
string line;
while ((line = reader.ReadLine()) != null)
{
sb.AppendLine(line);
}
}
ViewModel.Text = sb.ToString();
You could place that text in a text file, and read it out in code
http://msdn.microsoft.com/en-us/library/db5x7c0d.aspx