How to programmatically convert a pdf file to a text file? - c#

I need to load a pdf file and then convert it to a text file programmatically in order to then parse it.
Another possibility would be to execute the file (execute Adobe Reader, with the pdf file as the argument) and then "send keys" to it to save the file as text.
However, I would prefer not to do it that way (opening the file) but will if that is the only solution. But: is it even possible to do a "send keys" sort of functionality in C#/WPF?
Note: I don't want to buy any custom components, and besides, I'm using Visual Studio 2012 RC in this "home" project, so I don't know if 3rd party components would be compatible anyway.

If you are looking to deploy this application to other users, I would tend to lean towards using one of the many PDF libraries available and process the PDF via code vs. attempting to use Adobe Reader. It will eliminate issues if your users don't have Adobe Reader installed.
Try starting at the link below for some library ideas.
https://stackoverflow.com/questions/373926/lightweight-open-source-pdf-library-in-c
C# PDF Control & Library

Related

The best way to extract text from common documents formats (primarily rtf, doc, docx, pdf, epub, mobi) that works with UWP?

I'd like to implement support for these types of files in my application, but for this I need something that will let me extract raw text from these filetypes.
I'm looking for either a solution that doesn't require any additional libraries, or an all-in-one library/NuGet package. I took a look at GemBox.Document but it doesn't seem to be working with UWP projects.
What would be the best option for this?
I'm looking for either a solution that doesn't require any additional libraries, or an all-in-one library/NuGet package.
There is no such package.
In the standard UWP app we can read .rtf file with the Rich edit box, there is code sample in this document shows how to edit, load, and save a Rich Text Format (.rtf) file in a RichEditBox.
For .doc, .docx, aka. MS Word document, especially the version after 2007, it uses Open-XML-SDK and currently it doesn't support UWP platform.
For .pdf documents, you can refer to #Franklin Chen's thread: [UWP]PDF Viewing on a Windows Universal App.
For epub files, it is a ZIP archive file, to parse this file, you can refer to the thread: [WP8.1][C#] How can i read an EPub file in c# on Windows Phone!?.
For mobi files, sorry I couldn't find any useful information for development for the moment, I can only now suggest to convert it to pdf file with free online service.
But in a word, since Open-XML-SDK currently doesn't support UWP platform. It is not possible to find a solution or package for standard UWP app. You can try to find such a web service and implement this service in your app, or you can use commercial libraries which can read documents in all these formats.

What is the standard way for dealing with PowerPoint (.PPTX) files on the server?

I've been tasked with a feature that can generate PowerPoint files on the server using C#. I'd basically start with a template, and programmatically replace some text with live data from the database. I've been doing some research into this area for the past day and here's what I've found:
PowerPoint has this sort of thing built in, meaning it can connect to external data sources and pull in data. Most examples of this, I've found, either use PowerPoint automation done on the server (I've been advised against this) or assume a SQL Server backend. Our company uses Oracle for our RDMS needs. Oracle has a solution for this called Oracle BI, but it requires a whole new web server setup to run various Java EE components and what not. I didn't look at the price, but knowing Oracle it's not cheap. It also requires new software to be installed on the end user's machine, which we really want to avoid.
Generating PowerPoint files on the fly is possible. The company that is basically the go-to guys for this problem (every help forum points to them, and they get all the rave reviews) is Aspose. They have .NET components for dealing with just about any Office format you can think of. The problem is, they are astronomically expensive. Just the PowerPoint component (a site license for up to 10 developers) would cost $3,995.
The third possibility is generating a solution in-house. After all, a PPTX file is just xml, right? Well, looking closer, a PPTX appears to be a gzip archive. It contains many folders, each containing many XML files. Modifying a PPTX file would, correct me if I'm wrong, entail unzipping the file to a temporary directory, reading the XML file and modifying the contents, then packaging up everything again and write the file out to the response stream. Perhaps there are libraries that can work with gzip streams on the fly without extracting everything.
My Question: Are there easier ways to work with a PPTX file using .NET that don't require working with compressed XML files or buying very expensive software? Basically, we need to modify a PowerPoint file, change some text, and allow the user to download that generated file from a web server.
OpenXML is Microsoft's .Net library that lets you manipulate Office documents. It lets you open a PPTX file and provides an object model that wraps the XML contents.
Here's the link to the OpenXML SDK and the MSDN documentation.
I've used OpenXML to let a ASP.Net page dynamically generate Word documents from a database.
Don't use Office Interop on a web server. It's an all-around bad idea.
If you are only replacing text placeholders for files that will not change, the home grown solution that finds the placeholders in the xml files in the gzip archive should be doable. .Net has had zip support for some time, and it is greatly improved if you are able to use .Net 4.5, so you shouldn't need to extract the archive to a temporary location at all.
PowerPoint should also support connecting directly to Oracle in the same way it supports connecting to Sql Server (just play around with the connection options), without needing the special Oracle BI stuff. However, I'd still prefer the home-grown solution, as this will only work while the powerpoint file is able to reach your database directly, which is typically only possible in your local LAN environment or with an active VPN.
If you want anything fancier than a simple text replacement, perhaps looks for an Aspose competitor.

Program to put text into LibreOffice document

I'm trying to make a program in C# which should put text into an opened LibreOffice document (Writer).
A first the user can make some decisions about the text (saved to string variables) and when clicking on a button it should put the text from these strings to the document.
How can I do that?
Libre Office uses Open Document Format (ODF) (its actually an XML based format and is usually compressed by using zip) which is an easy format to work with ,I have found AODL to be the only openSource library (check below links) and I'm also sure .NET libraries can do the heavy lifting for you, here are some tutorials and links to help you out.
AODL allows your application to support the OpenDocument Format.(OpenSource .NET Library)
How to Read and Write ODF/ODS File
Read and write ODF/ODS files (OpenDocument Spreadsheets)

How to merge different document types and show as stack in .NET application

Suppose in .NET (don't care what language) I want to show a user a PDF, Word and Excel file together. I am trying to replicate a document process where a user might have a PDF file and he would like to attach a WORD file and an Excel file let's say to make a stack of documents (that I would save in some directory). Then he would like to click on a button and see a stack of these documents in 1 application of some sort.
How can I display the stack of documents WITHOUT first opening WORD, then openinig EXCEL and then openining ADOBE ACROBAT - this would be really annoying for the user. I would like one unified application or some idea to mimic one in .NET that can just show all 3 documents as if they were printed one after the other on paper. (I hope I am explaining this clearly)
The only thing I can think of to do this would be to leverage some sort of PDF conversion process to create one PDF file containing all three of these documents in "printed" (page-by-page) form, and then show that. The one application I can think of that could show all of these files is a web browser with appropriate Office and Acrobat viewer plugins, and you might find it difficult to leverage that, as browser preference and other user OS settings can cause various strategies for application launching to fail.
I would convert the documents in PDF and develop a pdf viewer inside your application.
I would use a ready made library for that, don't reinvent the wheel.
For example: http://www.quickpdflibrary.com/products/quickpdf/index.php

View PDF through C# .Net desktop App

I want to know how I can view a PDF through a C# .net desktop App. I am trying to create a application to view PDF using visual studio 2008
There is a pdf reader libraries called iText(iTextSharp). But it didn't help me
You can host a ie web browser control in your application and that will allow the user to view a pdf if they have a reader installed.
I can provide an example if you tell me whether you are using WPF or WinForms.
Drag WebControl on to you form
Set the path in code
Done Press F5
I'm not sure what netbeans has to do with anything, but take a look at this question here How to render pdfs using C#
Essentially you need to get a 3rd party PDF viewer or write one yourself. There are quite a few around and would probably take a look at something like PDFViewForNet
iText isn't a PDF viewer.
If you want to read PDF documents in your application there are couple of Open Source PDF Libraries.

Categories

Resources