I am trying to parse a PDF file that has two columns of text on most pages and no images. I tried using the iTextSharp solution that can be found at how can i get text formatting with iTextSharp . It seemed to be working for me, but then I noticed some rather serious issues with the text being returned out of order in some places on my PDF. I am simply looking for it to parse the text in the same order that it exists on each page (no special order), but this is not happening. I was wondering if there is a version of the TextWithFontExtractionStrategy solution available in iText 7 that would not exhibit this problem (or even a version of iTextSharp that works correctly for that matter). I would appreciate any assistance.
Related
I know how I can extract text formatting from a PDF using as it is explained in Extract fontname, size, style from pdf with iText .
I even know how I can etract text with the right order as it is explained in iText7 reading out lines in a wrong order .
However, it is not easy at all to extract text formatting in the correct order.
In other words, how can I use two strategies when I am extracting text in iText?
Try using Docotic.Pdf instead. All of the formatting issues I spent hours working on with no resolution in iText7 were not issues at all when I switch to Docotic.Pdf. No wonky configs or poor documentation. It just works!
I had this stupid idea of creating a template as a .docx or .rtf or .pdf and then replacing the text in that document to generate reports. This seemed like a better way of doing it than using paid reporting software.
Well, I believe I've tried just about everything now and I'm amazed at how impossible it is to do anything with pdfs.
Try 1
HTML -> PDF
A lot harder to design the template. It doesn't look the same when you print it. Never got it working outside of a command line example (not sure how well, say, iTextSharp-LGPL would even work or if it could handle base64 strings as I'm not sure how else you are going to tell it about images). In any case, doing it this way makes it too hard to design the template.
Try 2
OpenXml -> PDF
I stupidly assumed that because Word could save as PDF that OpenXml could to. I was wrong. It cannot save as a PDF.
Try 3
OpenOffice/LibreOffice (docX -> PDF)
It can't read OpenXml which is a problem because I was editing the template as OpenXml and then saving that result (as a .docx) but it can't read that saved document.
Try 4
iTextSharp LGPL
This one just doesn't work, lol. And apparently even though when you google "convert rtf to pdf" the ONLY thing that comes up is iText and its derivatives it doesn't convert rtf documents to pdf documents. I verified this myself (it only saves the text not the formatting) and later found this post to convince me I wasn't doing something wrong.
Try 5
PDF -> PDF
Since converting ANYTHING to a PDF seems to be impossible maybe I can save the template as a PDF and just do a text replace on that. Nope, lol, that is apparently a very difficult thing to do.
Try 6
Pandoc (.odt/.docx -> pdf), (.rtf -> .pdf not supported)
pandoc mockup2.odt -s -o mockup2.pdf
link to the files in the picture. *note, it messes up in the same way if you try converting .odt/.docx to .tex.
What do I do here? Buy software so that I can save a file as PDF? Is that the only option?
I have a solution. I'm not saying it's the best solution. LibreOffice (or possibly OpenOffice if you are so inclined) accepts command line arguments that will do the switch.
soffice.exe --headless --convert-to pdf mockup.odt
*note - this is after I added libreoffice to my path (C:\Program Files\LibreOffice\program). idk why it's called soffice.exe instead of libreoffice.exe.
where i found the answer
relevant documentation
I might have a working solution for you, if you are stuck with the docx-file for the template.
I found one free solution for docx to pdf conversions, without using microsoft.interop, etc.: See first answer in this stack overflow post
It uses two tools: The open xml power tools and DinkToPdf (Which is essentially a wkhtmltopdf wrapper). The html to pdf part works just fine, but the docx to html part looks like a catastrophe at first. You can fix this with custom css (There are some resources online).
Powertools-.NetStandard
DinkToPdf-GitHub
There are more possibilities for proprietary software, like Asposes.Words and Syncfusion file-formats. Most of the proprietary solutions are pretty expensive...
If you are just working on a Windows Environment, where MS-Office is installed, you can use Microsoft.Interop. It is by far the easiest solution (In this post, Interop is mentioned several times Stackoverflow Word to PDF
If you found another (better) working solution, please let me know. I still have not decided if I will use a proprietary or a free solution. :-)
When I am rendering Arabic text on a report, the text is not rendering correctly. The text appears to be being rendered one individual letter at a time, rather than being joined up.
The text is being displayed right to left correctly (I've used the dir=rtl formatting on each element I'm adding), which is confusing me.
Any help anyone can give is appreciated.
I've added a screenshot of some text as an example.
So I emailed abcpdf directly and they told me this:
ABCpdf 8 supports Arabic, but does not support contextual ligatures with the Doc.AddHtml approach - only with regular HTML/CSS (i.e. using Doc.AddImageUrl or Doc.AddImageHtml).
Support for contextual ligatures with Doc.AddHtml was added in ABCpdf 9.1 and is present in the current live release, ABCpdf 10.
Further clarification:
if I add an html file with the specific Arabic text onto my server, i should be able to access it and render the text in that file correctly?
That's correct. Please ensure you have the final ABCpdf 8 minor version (8123) from our Downloads page. And you may need to use the Gecko HTML engine - please see the HtmlOptions.Engine property.
I have a problem I'm not sure can be solved, I wondered if I could ask for help on here.
I am creating an invoice in PDFsharp in C#, I have written the first page displaying strings taken from datagrids along the way. That's all fine.
However I'm coming to the summary page and I need to output the values looping through the datagridview into the PDF. Is that possible within PDFsharp or do I have to go to MigraDoc to do this?
If so, is MigraDoc still supported?
As I cannot get the references into my solution.
Any help would be appreciated.
Thanks.
MigraDoc is still supported.
MigraDoc uses PDFsharp to create PDF files, so everything that can be done with MigraDoc can also be done with PDFsharp only.
MigraDoc supports tables with borders and handles pagebreaks automatically - so for invoices it is most likely a good idea to use MigraDoc and not PDFsharp.
See also:
http://www.pdfsharp.net/wiki/Invoice-sample.ashx
With respect to adding the references, detailed error messages would be helpful.
The source code package of MigraDoc includes samples with Visual Studio solutions and all references.
I need your expertise in fixing a problem I have been facing from a week. This has already turned into a 'royal pain in the lower back side' category and time is running out fast.
Problem
I have developed a C# script that I call from ColdFusion to assist me in converting Word documents to PDF. This script is doing the conversion properly, but the (justified) text in the paragraphs is not being spaced properly. I get a non-select-able space next to some character.
See the image -
What is should look like...
What it looks like...
The red marks are added to show the spaces created.
Now, if I open the file by word manually and save it, I do not get this same problem. What is that I'm missing or doing wrong, that has resulted in this error?
Details of my application flow -
I create a DOC (based on my design needs) and save it as HTML.
This HTML will be used by my CF application to manipulate the content based on some placeholders and the final output is again saved as HTML.
The xx.html file is renamed to xx.doc and passed to my C# based converter, which does the doc to pdf convertion via Word Automation.
I ponder in joy seeing my well formed PDF output, but get sad that the text is a bit messy.
I have tried this with multiple fonts and what i observe is that it only happens with certain fonts (in my case its Palatino - Linotype). I want to know, what is the difference from manual to automation? Is there a setting (like a boolean) that is to done for this or some other hacks?
My system configuration -
Windows 2008 R2 64b + .NET 4 + Office 2010
Note: I know that office automation is bad. So on this date and time, this is the only option I have to get my job done.
I found a work-around for this. It seems to be dependent on the selected printer!
First go to the print dialog (File / Print) and select "Microsoft XPS Document Writer" instead of your normal printer. You don't need to print anything,
Now export the PDF (File / Export / Create PDF)
Selecting other printer drivers may work also. I found this solution at this thread: http://www.howtofixcomputers.com/forums/microsoft-office/bad-kerning-pdf-using-save-pdf-xps-add-244886.html
Notes:
I also installed Adobe PDF Writer before finding this. It's possible that affected it.
My system is Windows 8.1 & Office 2013 running under Fusion 5.0.3 on a Mac mini.
I guess that the trouble could be in used font. Please try:
change font
ensure, that language of the text (LanguageID Property) is correct
Or it could be inserted special character, for example, wrong way interpreted inserted "no-width optional break". Try to select the text, cut&paste in word and see non-printable characters - it should be visible.