how extract text from pdf using c#

how extract text from pdf using c# - c#

Is there a way to get text that exist inside the border of specific color let say "red".
is it possible to all the text that exist in side "red" border box from pdf using c#. i had googled it but i did not found anyway to get text with style format from pdf.

The answer is not simple, unfortunately. Usually, when programmers need to write code that can parse text out of PDF files (what you are trying to do), they use third-party code libraries that other people wrote specifically for manipulating PDFs. In the C# world, there are a few options for well-known PDF manipulation libraries, but the ones that are easiest to use are not free. I've personally had good results using a library called iTextSharp, but it is not free.

Related

How can I extract text layout in the correct order from a PDF file uing iText in C#

I know how I can extract text formatting from a PDF using as it is explained in Extract fontname, size, style from pdf with iText .
I even know how I can etract text with the right order as it is explained in iText7 reading out lines in a wrong order .
However, it is not easy at all to extract text formatting in the correct order.
In other words, how can I use two strategies when I am extracting text in iText?

Try using Docotic.Pdf instead. All of the formatting issues I spent hours working on with no resolution in iText7 were not issues at all when I switch to Docotic.Pdf. No wonky configs or poor documentation. It just works!

Does MuPdf library have unicode or text search functionality?

Background
I am working on a WPF windows application and I want add embedded PDF viewer with only basic functionalities including PDF view, text search and page navigation.
I tried embedded Internet Explorer and Adobe PDF Reader installed method (this way ) but this method is not suitable for our requirement as Adobe PDF Reader has too may external links which can not be allowed because of the security reasons of the application.
Therefore, I am trying to use moonpdf library. This library works fine with our requirements but the only problem is there is no text search functionality in this library. (I think it shows PDF as images)
Then, I have download moonpdf source code and realized that moonpdf is using libmupdf.dll wrapping to c#.
I can modify the moonpdf source code and mupdf source code for our requirement if needed.
My Question
Is there any text search functionalities in mupdf? if so how can I use it?

In the basic mupdf library, there are several functions for searching for text. These work by searching a page for a text string, in a few different variants, and returns the area for all hits of the given text. You need to iterate over the pages yourself (in order to do forward or reverse search).
fz_quad hits[1000];
count = fz_search_page(ctx, page, needle, hits, nelem(hits));
That said, I do not know how or even if "moonpdf" has wrapped these functions.

You can certainly extract the text from a document, the MuPDF library will do that. I believe it's up to you to apply your own search criteria after that. I'm afraid I'm not expert enough to answer the 'how to' part of it though. I imagine one of the mutool examples would be helpful here though. I'll see if I can get one of the developers to answer.

How to generate an RTF document server side in c#

I have tried using the System.Windows.Documents.FlowDocument server side, but ran into a problem with images.
What I need to produce is a document with headings, section breaks, page breaks, images (with text wrapping around from the left or the right), tables and ideally some kind of table of contents.
I use c# and asp.net.
Is there a library that will do most of this?
RTF has been chosen because the document needs to be openable in older versions of word, be editable, and we can't run word on the server.
Thank-you

I used MigraDoc in the past, it is a free library. You can create PDFs or RTFs. Just Google it.

I have started using .net rtf writer.
It produces clean rtf, but doesn't do everything I need.
There is pretty good documentation for rtf here.
I am working some things out for my self. For example, I needed to be able to wrap text around an image. Whilst the rtf writer above enables you to add images to documents, it does so by putting the image in its own paragraph. What I need is a shape element.
In the rtf it ends up looking something like this (some of the numbers define the size and position of the image in twips):
{\shp{\*\shpinst\shpleft3801\shptop1\shpright8300\shpbottom4500\shpfhdr0\shpbxcolumn\shpbxignore\shpbypara\shpbyignore\shpwr2\shpwrk0\shpfblwtxt0\shpz0
{\sp
{\sn pib}
{\sv
{\pict\pngblip\pichgoal4499\picwgoal4499
-- image binary data goes here --
}}}
{\sp
{\sn fLine}
{\sv 0}}}}
I sometimes just save something in word and try and understand what it did (but word seems to add a lot of noise).

How do I explore a PDF to determine if an element is text?

I have a PDF and want to extract the text contained in it. I've tried a few different PDF libraries and they all return basically the same results. When extracting the text from a two page document with literally hundreds of words, only a dozen or so words from the header are returned.
Is there any way to tell if the text I'm after is actually text or a raster image of the text? I'm thinking something along the lines of Firebug's "Inspect Element" but at this point I'll take any solution that tells what I'm really looking at.
This project really doesn't justify attempting to use OCR. And, although a simple solution, using fields in the PDF is not an option since the generator of the file is a third party.

If Acrobat/Reader can select the text, then it Is Text.
Reasons your library might not be able to find the text in question:
Complex/bad fonts or encodings. Adobe can be very forgiving of garbage in, somehow managing to get Good Info out.
The text could be in an annotation rather than the page contents. It won't matter what program parses the content stream if you need to look in the annot array instead.
You didn't name a particular library, so it's possible that the library you're using doesn't look inside XObject Forms. That's unlikely in an even remotely mature API, but stranger things have happened.
If you can get away with copy/pasta from Reader, then just go that route.

Have you tried Amyuni PDF Creator .Net? It allows you to enumerate all components from a specified rectangular region of a page and inspect their type from a predefined types list. You could run a quick test using the trial version and the following code sample for text extraction:
// open a PDF file
axPDFCreactiveX1.Open(System.IO.Directory.GetCurrentDirectory()+"\\sampleBookmarks.pdf", "");
axPDFCreactiveX1.Refresh ();
String text = axPDFCreactiveX1.GetRawPageText (1);
MessageBox.Show (text);
Additionally, it provides Tesseract OCR integration in case you needed it.
Disclaimer: I am part of the development team of this product.

Check this site out. It may contain some helpful code snippets. http://www.codeproject.com/KB/cs/PDFToText.aspx

Parsing Office Documents

I`d like to be able to read the content of office documents (for a custom crawler).
The office version that need to be readable are from 2000 to 2007. I mainly want to be crawling words, excel and powerpoint documents.
I don`t want to retrieve the formatting, only the text in it.
The crawler is based on lucene.NET if that can be of some help and is in c#.
I already used iTextSharp for parsing PDF

If you're already using Lucene.NET you might just want to take advantage of the various IFilters already available for doing this. Take a look at the open source SeekAFile project. It will show you how to use an IFilter to open and extract this information from any filetype where an IFilter is available. There are IFilters for Word, Excel, Powerpoint, PDf, and most of the other common document types.

There is an excelent open source project POI, only drawback - it is written for Java.
The .net port is somehow very beta.

Here is a good list of various tools for converting Word documents to plaintext, which you can then do whatever with.

Here's a nice little post on c-charpcorner by Krishnan LN that gives basic code to grab the text from a Word document using the Word Primary Interop assemblies.
Basically, you get the "WholeStory" property out of the Word document, paste it to the clipboard, then pull it from the clipboard while converting it to text format. The clipboard step is presumably done to strip out formatting.
For PowerPoint, you do a similar thing, but you need to loop through the slides, then for each slide loop through the shapes, and grab the "TextFrame.TextRange.Text" property in each shape.
For Excel, since Excel can be an OleDb data source, it's easiest to use ADO.NET. Here's a good post by Laurent Bugnion that walks through this technique.

You might also consider checking out DtSearch (www.DtSearch.com). Although it is primarily a searching tool, it does a great job of extracting text from a large number of file types and is considerably cheaper than other options like the Oracle/Stellent OutsideIn technology or the equivalent from Autonomy.
I've been using DtSearch for years and find it indispensible for this type of task.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

how extract text from pdf using c# - c#

Is there a way to get text that exist inside the border of specific color let say "red". is it possible to all the text that exist in side "red" border box from pdf using c#. i had googled it but i did not found anyway to get text with style format from pdf.

Related

How can I extract text layout in the correct order from a PDF file uing iText in C#

Does MuPdf library have unicode or text search functionality?

How to generate an RTF document server side in c#

How do I explore a PDF to determine if an element is text?

Parsing Office Documents

Categories

Resources