how to convert pdf file to text file using c#.net

how to convert pdf file to text file using c#.net - c#

currently i have been using the following code and i am using some dll files from pdfbox
FileInfo file = new FileInfo("c://aa.pdf");
PDDocument doc = PDDocument.load(file.FullName);
PDFTextStripper pdfStripper = new PDFTextStripper();
string text = pdfStripper.getText (doc);
richTextBox1.Text = qq;
using this code i can able to get text file but not in a correct format plz give me a some ideas

Extracting the text from a pdf file is anything but trivial.
To quote from th iTextSharp tutorial.
"The pdf format is just a canvas where
text and graphics are placed without
any structure information. As such
there aren't any 'iText-objects' in a
PDF file. In each page there will
probably be a number of 'Strings', but
you can't reconstruct a phrase or a
paragraph using these strings. There
are probably a number of lines drawn,
but you can't retrieve a Table-object
based on these lines. In short:
parsing the content of a PDF-file is
NOT POSSIBLE with iText."
There are several commercial applications which claim to be able to do it. Caveat Emptor.
There is also a free software library called Poppler http://poppler.freedesktop.org/ which is used by the pdf viewers of GNOME and KDE. It has a function called pdftotext() but I have no experience with it. It may be your best free option.

There is a blog article explaining the issues with PDF text extraction in general at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text

Related

Copy entire content from one Word document to another using C#

I am trying to copy entire content (including page numbers, and page layout) from a word document to another, using Microsoft.Office.Interop.Word.
I cannot use SaveAs method because the document in which I want to paste the contents is already created and it contains VBA code.
Also, I cannot use XML related code because the document in which I am copying the content is in the older format. This document is part of an old way of uploading a document to a server database, using VBA code.
Using VBA code, I can copy the entire content without any issue.
Selection.WholeStory
Selection.Copy
Windows("document.doc").Activate
Selection.WholeStory
Selection.PasteAndFormat (wdFormatOriginalFormatting)
For C#, I used Microsoft.Office.Interop.Word to replicate the VBA code.
Word.Application objWordOpen = new Word.Application();
objWordOpen.Visible = false;
Word.Document doclocal = objWordOpen.Documents.Open(filepath);
doclocal.ActiveWindow.Selection.WholeStory();
doclocal.ActiveWindow.Selection.Copy();
Document d1 = objWordOpen.Documents.Open(filepath2);
d1.Activate();
d1.ActiveWindow.Selection.WholeStory();
d1.ActiveWindow.Selection.PasteAndFormat(Word.WdRecoveryType.wdFormatOriginalFormatting);
I have also tried using range
Word.Range oRange = doclocal.Content;
oRange.Copy();
The content is copied into the document, but without headers and footers. Also, when using Selection.WholeStory() approach, the page margins settings don't get copied.
What changes should I make to the c# code in order to achieve my result?

MS Office applications have complicated relationships with the clipboard. Between various optimisations that may lead to cryptic prompts, and numerous formats they support, it is best to not do anything remotely funny between a Copy and a Paste.
The VBA code follows this advice, the C# code opens a document between copying and pasting.
Make sure you open the documents in advance and not in the middle of a copypaste.

How to read pdf file to a text file in a proper format using Spire.PDF or any other library?

How can I read pdf files and save contents to a text file using Spire.PDF?
For example: Here is a pdf file and here is the desired text file from that pdf
I tried the below code to read the file and save it to a text file
PdfDocument doc = new PdfDocument();
doc.LoadFromFile(#"C:\Users\Tamal\Desktop\101395a.pdf");
StringBuilder buffer = new StringBuilder();
foreach (PdfPageBase page in doc.Pages)
{
buffer.Append(page.ExtractText());
}
doc.Close();
String fileName = #"C:\Users\Tamal\Desktop\101395a.txt";
File.WriteAllText(fileName, buffer.ToString());
System.Diagnostics.Process.Start(fileName);
But the output text file is not properly formatted. It has unnecessary whitespaces and a complete para is broken into multiple lines etc.
How do I get the desired result as in the desired text file?
Additionally, it is possible to detect and mark(like add a tag) to texts with bold, italic or underline forms as well? Also things get more problematic for pages have multiple columns of text.

Using iText
File inputFile = new File("input.pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
SimpleTextExtractionStrategy stes = new SimpleTextExtractionStrategy();
PdfCanvasProcessor canvasProcessor = new PdfCanvasProcessor(stes);
canvasProcessor.processPageContent(pdfDocument.getPage(1));
System.out.println(stes.getResultantText());
This is (as the code says) a basic/simple text extraction strategy.
More advanced examples can be found in the documentation.

Use IronOCR
var Ocr = new IronOcr.AutoOcr();
var Results = Ocr.ReadPdf("E:\Demo.pdf");
File.WriteAllText("E:\Demo.txt", Convert.ToString(Results));
For reference https://ironsoftware.com/csharp/ocr/
Using this you should get formatted text output, but not exact desire output which you want.
If you want exact pre-interpreted output, then you should check paid OCR services like OmniPage capture SDK & Abbyy finereader SDK

That is the nature of PDF. It basically says "go to this location on a page and place this character there." I'm not familiar at all with Spire.PFF; I work with Java and the PDFBox library, but any attempt to extract text from PDF is heuristic and hence imperfect. This is a problem that has received considerable attention and some applications have better results than others, so you may want to survey all available options. Still, I think you'll have to clean up the result.

Reading images from Word files using C# Word API without using Clipboard

I've been working on an application to read images from multiple word files and store them in one single word file using Microsoft.Office.Interop.Word in C#
EDIT: I also need to save a copy of the images on the file system, so I need the image in a Bitmap or similar object.
This is my implementation so far, which works fine:
foreach (InlineShape shape in doc.InlineShapes)
{
shape.Range.Select();
if (shape.Type == WdInlineShapeType.wdInlineShapePicture)
{
doc.ActiveWindow.Selection.Range.CopyAsPicture();
ImageData = Clipboard.GetDataObject();
object _ob1 = ImageData.GetData(DataFormats.Bitmap);
bmp = (Bitmap)_ob1;
images[i++] = bmp;
/*
bmp.Save("C:\\Users\\Akshay\\Pictures\\bitmaps\\test" + i.ToString() + ".bmp");
*/
}
}
I have:
Selected the images as InlineShapes
Copied the shape into Clipboard
Stored the shape in the Clipboard in a DataObject
Extracted the shape from the DataObject in Bitmap format and stored in a Bitmap object.
I've been told to refrain from using Clipboard in Word automation and use the Word APIs instead.
I've read up on it and found an SO answer stating the same.
I looked up many implementations of reading images from Word files on MSDN, SO etc. but could not find any without using clipboard.
How do I read images from Word files using the Word APIs from Microsoft.Office.Interop.Word namespace alone without using Clipboard ?

Word documents in the Office Open XML file format store images in Base64. So it should be possible to extract that information and convert/stream it to a file. You can access the information when the document is open in the Word application using the Range.WordOpenXML property.
string shapeBase64 = shape.Range.WordOpenXML;
This will return the entire Word Open XML in the flat file OPC format. In other words, it won't contain only the picture in Base64, but the entire zip package definition as XML that surrounds it. In my quick test, the tag the contains the actual Base64 is
<pkg:binaryData>
That's a child element of
<pkg:part pkg:name="/word/media/image1.jpg" pkg:contentType="image/jpeg" pkg:compression="store">
Note that it would also be possible for you to get the entire document's WordOpenXML in one step:
document.Content.WordOpenXML
but might then need to understand the way the InlineShapes in the document body are linked to the actual information in the "media" part.
And it would be possible, of course, to work directly with the Zip Package (using the Open XML SDK, perhaps) instead of opening the document in the Word.Application.

Convert PDF to PDF/A3 or PDF/A-1 to PDF/A-3

I'm testing iTextSharp to generate ZUGFeRD-Files. My first step was to generate a ZUGFeRD conform file from an existing PDF/A-3 file. This was successfull by using PDFACopy and creating the necessary PDFFileSpecification.
The next step would be to generate a PDF/A-3 file from an existing PDF or PDF/A-1 file and this is the hard part.
First, when I'm trying to use PDFACopy in combination with a regular PDF (not PDF/A) im getting an error that PDFACopy can only be used with PDF/A-conform files. My first question is, how to get an PDF/A-3-conform file from a PDF with iTextSharp?
To reduce the gap, I decided to convert the PDF into PDF/A-1 file with ghostscript (cf. How to use ghostscript to convert PDF to PDF/A or PDF/X?).
This was succesfull and I tried again. Then the error "Different PDF/A version." was thrown. It seems that I can't copy from existing PDF/A-1 into a new PDF/A-3. How can I create this PDF/A-3 from an existing PDF(/A-1)? Is that even possible?
Here is my code:
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(XML);
byte[] xmlBytes = Encoding.Default.GetBytes(xmlDoc.OuterXml);
Document doc = new Document();
PdfReader src_reader = new PdfReader(pdfPath);
FileStream fs = new FileStream(DEST, FileMode.Create, FileAccess.ReadWrite);
PdfACopy aCopy = new PdfACopy(doc, fs, PdfAConformanceLevel.ZUGFeRD);
doc.AddLanguage("de-DE");
doc.AddTitle("title");
doc.SetPageSize(src_reader.GetPageSizeWithRotation(1));
aCopy.SetTagged();
aCopy.UserProperties = true;
aCopy.PdfVersion = PdfCopy.VERSION_1_7;
aCopy.ViewerPreferences = PdfCopy.DisplayDocTitle;
aCopy.CreateXmpMetadata();
aCopy.XmpWriter.SetProperty(PdfAXmpWriter.zugferdSchemaNS, PdfAXmpWriter.zugferdDocumentFileName, "ZUGFeRD-invoice.xml");
//Ab hier können keine Metadaten mehr geschrieben werden
doc.Open();
ICC_Profile icc = ICC_Profile.GetInstance(new FileStream(ICM, FileMode.Open));
aCopy.SetOutputIntents("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);
[...add the dictionary to doc..]
aCopy.AddDocument(src_reader);
doc.Close();
One more question: addDocument works, but when I'm using copy.addPage(copy.getImportedPage(src_reader, i)), an error "the document has no pages" will be thrown. WHY?

1. Can you convert a regular PDF to a PDF/A document?
The answer is: it depends.
PDF/A is a subset of PDF and involves some obligations (e.g. all fonts must be embedded) and restrictions (e.g. no Javascript is allowed). iText can't "automatically" convert a regular PDF to a PDF/A for a number of reasons. For instance: if a font is not embedded, iText doesn't know which font to use to replace the unembedded font, nor where to find the necessary font program. Usually this requires human interaction because replacing one font by an arbitrary other font usually results in very ugly PDFs.
The answer is: it depends because some people are using iText to convert PDF to PDF/A, but this involves a lot of programming and human decisions. I see that you succeed when using GhostScript. In that case, GhostScript is making some decisions in your place. This can lead to acceptable results. In some cases, the result will not be acceptable (e.g. very odd-looking PDFs if the fonts don't match).
2. Can you convert a PDF/A-1 file to a PDF/A-3 file?
The PDF/A standard is written in such a way that old versions of the PDF/A specification are never outdated. Newer versions only add newer functionality. For instance: PDF/A-1 was based on the PDF 1.4 specification. Optional Content functionality (OCG) was introduced in PDF 1.5. The introduction of OCG is one of the differences between PDF/A-2 and PDF/A-1.
This means that every file that conforms to PDF/A-1 automatically conforms to PDF/A-2. However, a PDF/A-2 file could contain functionality that isn't supported in PDF/A-1.
3. What is the difference between PDF/A-2 and PDF/A-3?
PDF/A-2 and PDF/A-3 are identical, except for one difference: a PDF/A-3 file can have attachments that aren't PDF/A files. For instance: a PDF/A-3 file can have a Word file as attachment, an XLS file, a plain text file,... You mention ZUGFeRD: in that case, the PDF/A-3 file has at least an XML file as attachment.
Summarized:
This is a broad answer to a broad question (your question goes in many different directions, so it's hard to give you a specific answer). Why don't you use the already built-in ZUGFeRD support to create the invoices? Read ZUGFeRD, the future of invoicing for more info.

Build Word Document from template

I have a request to create a word document on the fly based on a template provided to me. I have done some research and everything seems to point at OpenXML. I have looked into that, but the cs file that gets created is over 15k lines and is breaking my VS 2010 (causing it to not respond every time I make a change).
I have been looking at this tutorial series on Open XML
http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2011/10/13/getting-started-with-open-xml-development.aspx
I have done things in the past with text files and Regular Expressions, but since Word encrypts everything, that does not work. Are there any other options that are fairly lightweight for creating word documents from templates.

//Hi, It is quite simple.
//First, you should copy your Template file into another location.
string SourcePath = "C:\\MyTemplate.dotx";
string DestPath = "C:\\MyDocument.docx";
System.IO.File.Copy(SourcePath, DestPath);
//After copying the file, you can open a WordprocessingDocument using your Destination Path.
WordprocessingDocument Mydoc = WordprocessingDocument.Open(DestPath, true);
//After openning your document, you can change type of your document after adding additional parts into your document.
mydoc.ChangeDocumentType(WordprocessingDocumentType.Document);
//If you wish, you can edit your document
AttachedTemplate attachedTemplate1 = new AttachedTemplate() { Id = "MyRelationID" };
MainDocumentPart mainPart = mydoc.MainDocumentPart;
MySettingsPart = mainPart.DocumentSettingsPart;
MySettingsPart.Settings.Append(attachedTemplate1);
MySettingsPart.AddExternalRelationship("http://schemas.openxmlformats.org/officeDocument/2006/relationships/attachedTemplate", new Uri(CopyPath, UriKind.Absolute), "MyRelationID");
//Finally you can save your document.
mainPart.Document.Save();

I am currently working on something along these lines and I have been making use of the Open XML SDK and the OpenXmlPowerTools The approach been taken is taking the actual template file opening it up and putting text into various place holders within the template document. I have been using content controls as the place markers.
The SDK tool to open up a document has been invaluable in being able to compare documents and see how it is constructed. However the code generated from the tool I have been refactoring heavily and removing sections that are not being used at all.
I can't talk about doc files but with docx files they are not encrypted they are just zip files that contain xml files
Eric White's blog has a large number of examples and code samples which have been very useful

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.