Convert DOC to PDF Without Interop or Aspose in C# code

Convert DOC to PDF Without Interop or Aspose in C# code - c#

I need to convert a document .doc to .pdf without using Microsoft Interop or Aspose cause this code using a licence(Aspose) or need to have Microsoft Office in server(Interop).
How can I do it? Is that possible?

From your question I guess you want the conversion to be done on the server side (you mentioned Aspose.net).
I searched for tens if not hundreds of libraries, and could not find something, free, for commercial use, that has similar or better quality results then the pricey Aspose.
That said, you can get decent results for simple documents, using OpenOffice or better yet, LibreOffice.
Once installed, you can invoke a command line exe with parameters to do the conversion.
soffice --headless --convert-to pdf filename.doc
For more information you can either google it, or, if all else fails, read the documentation

Related

Convert docx to postscript

I need to convert a Word document (docx) to a postscript file so that I can use this postscript file to generate PDF using the Ghostscript command line tool.
How do I generate the postscript file from the docx?
I need to code using .NET/C#. I found about LaTeX which generates postscript but how do I make my Word file be used with LaTeX or any other tool to get the postscript generated?

There are three main products I will mention that understand DOCX.
The obvious one is MS Word. It produces the definitive rendering of all DOCX files. Nothing is ever going to be exactly the same. By definition it is always correct. However it is not really designed for automated conversion and getting it to do this kind of thing is fraught with difficulty. On a legal level the EULA may confict with your chosen solution.
OpenOffice.org is a great product. The EULA is much more accomodating. The freeness is attractive. However, while it will produce a pretty good output for most DOCX documents it does not for all. While it is similar to MS Word it is not the same and this is something you may notice, particularly for more complex documents. Probably more importantly, again it's not designed for automated conversions and trying to get it to do this can be fraught and tiresome.
WordGlue .NET (on which I work) is a native .NET library that understands DOCX. It is designed specifically to produce output which is the same as MS Word. While I'm not going to say it is perfect (it's a big task) it is superior to OpenOffice.org in that it does actually attempt this as a specfic design decision. However probably the biggest advantage is that it is designed for high perfomance multi-threaded server side conversion. It's native .NET and thus low impact in terms of security.
Products like ABCpdf (on which I work) will integrate with these three applicatons to allow conversion direct to PDF. Why bother going via PostScript if you want PDF? However if you really want to save as PostScript you can do that too.
Or indeed you can write your own code to integrate with these products. Just be aware of the caveats above regarding fraughtness and tiresomeness relating to MS Office and OpenOffice.org. To get these things working unattended requires an awful lot of attention.

You need to print it to a PostScript file, from an application which can read .docx files. Or you could just export direct to PDf from the app, as far as I know anything which reads .docx and can print, can also write a PDF file.

If you have a windows computer you can use the commandline
"%ProgramFiles%\Windows NT\Accessories\wordpad.exe" /pt foobaar.docx "printerThatDumpsPS"
You can find file printers for postscript printing for free on the internet. Or if you have adobe pfdf, pdf exchange or any PS printer. You can use c# to temporarily set the printers settings so that it does this for you.
So for example using pdf exchange as follows,
"%ProgramFiles%\Windows NT\Accessories\wordpad.exe" /pt foobaar.docx "PDF-XChange Printer 2012"
Produces a pdf file without much of a trace anywhere what program was used, assuming pdf exchange was set to save file without asking.
This produces a passable document but yeah it looses quiet many features. But it might be enough.

Need to find DOCX to PDF Conversion C# API

From within C#, I want to be able to take a DOCX file and convert it to PDF.
How can I do this?
The catch is that I would like to do other types too, e.g. images, doc files, etc.
I also ideally would like there to be no office installed on the computer where this software will be running.
Perhaps the answer is to some software that 'prints to pdf'
My software is dealing with arrays of data representing the file, so it would ideally be some kind of API that handles byte arrays.

There aren't a ton of good C# libraries for this one. It's hard to do without COM.
Here's one option:
http://www.aspose.com/categories/.net-components/aspose.pdf-for-.net/default.aspx

If you want something free (but requires Microsoft Word to be installed), you could try using Word itself via .NET code:
http://www.codeproject.com/KB/cs/CreatePDFsForFree.aspx
It isn't the solution for everything but it can be useful at times.

DOCX is Office 2007 format. If you don't mind using the built-in functionality of Office 2007, you might want to check this link out:
http://msdn.microsoft.com/en-us/library/bb412305.aspx

Office automation + Save As Pdf Add-in ?

Reading a Excel 2007 Spreadsheet in C#

I've done some googling and there seems to be a plethora of tools for reading excel 2007 spreadsheets using c#. I'd like to know which one performs best and is easy to use.

Frankly, it depends on if you have Excel installed on the machine.
If you have Excel installed, using Microsofts Office APIs is as quick and ( for the price ) painless as you are going to get. That said, if you don't have Excel installed, the question gets much trickier.
Also, note, if you are installing to a web server that is available to the world, using Microsoft's Office APIs often isn't actually legal, as the end user needs to have an office license for you to be legal. If this is the case, and you are developing for redistribution or internet deployment, make sure that the library you are considering has no dependencies on the Office APIs, as their redistribution isn't actually legal either. If you need to, for example, provide a viewer for people that don't have Excel installed, you can't legally use the Office APIs, nor can you use 3rd Parties that depend on that layer. Compliant libraries will make it clear in their description that they don't depend on Excel.

It depends what you're trying to do with the spreadsheet.
If you want to read tabular data, OleDB is the best option.
If you want to read formatting or specific cells, you should use a third-party component. (I can't recommend a specific vendor)
You should avoid using Excel's COM object model unless you want to interact with an open window or print a spreadsheet.

I've had pretty good results with Koogra.

Check the Open XML SDK. It includes a "Productivity Tool" that lets you walk through the structure of a document to help write your code as well.

I looked into the Open XML Format SDK about a year ago and it seemed like a good alternative that does not require Office installed.

I have used GemBox and the Microsoft Office APIs to read in spreadsheets. I have had success with both options but I tend to like GemBox a bit more. It is fast and simple to implement but comes with a price tag.

Generating a Word Document with C#

Given a list of mailing addresses, I need to open an existing Word document, which is formatted for printing labels, and then insert each address into a different cell of the table. The current solution opens the Word application and moves the cursor to insert the text. However, after reading about the security issues and problems associated with opening the newer versions of Word from a web application, I have decided that I need to use another method.
I have looked into using Office Open XML, but I have not found any good resources that provide concrete information on exactly how to use it. Also, someone suggested that I use SQL reporting services, but searching for information on how to use them, lead me nowhere.
Which method do you think is the most appropriate for my problem?
Code samples and links to good tutorials would be extremely helpful.

Thanks for all the answers, but I really did not want to pay for a plugin and using Word automation was out of the question. So I kept searching and eventually, through some trial and error, found some answers.
After throughly searching through Microsoft's site, I found some newer articles on the Office Open XML SDK. I downloaded the new tools and just started going through each them.
I then found the Document Reflector, which creates a class to generate XML code based off an existing Word Document (.docx). Using my Label Template Document and the code this tool generated, I went through and added a loop that appends table cells for each address. It actually proved to be fairly simple and way faster than using Word automation.
So, if you're still using Word automation check out the Office Open XML tools. Their surprisingly extensive for a free download from Microsoft.
Office Open XML SDK 2.0 Download

I use the Words plugin from Aspose.com to do mail merges (programming guide).

You can take a look show 137 and 138 on dnrTV (www.dnrtv.com). In these video's Beth Massi shows how to do some editing and mail merging with OpenXML. She does this by using the Open XML SDK and xml literals in VB. It requires no third party components. Also it doesn't require MS Office to be installed on the machine.
This video inspired me as a C# developed (and no VB experience) to do some XML manipulation in a separate dll in VB. I call into this dll from my C# application.
It is worth a try.

We have the product Aspose that tvanfosson has mentioned. The edition that we purchased works with SQL Reporting Services so it can be used with the scheduler for creating output. It is really a great product and we used in a system that needed to support Korean characters in the final document. It works great and was under $1K with support. Not bad.
The advantage of using a product like this is that you can continue to manage your data and the skill set required to produce the documents is at a level where a variety of developers can support its use.

Vanstee,
If you really want to do this in code, check out this post I just found on Google
http://kellychronicles.spaces.live.com/blog/cns!A0D71E1614E8DBF8!1364.entry

If you are using reporting services cant you just move the information in the word doc into a database table and read it from there, taking word out of the equation?

Parsing Office Documents

I`d like to be able to read the content of office documents (for a custom crawler).
The office version that need to be readable are from 2000 to 2007. I mainly want to be crawling words, excel and powerpoint documents.
I don`t want to retrieve the formatting, only the text in it.
The crawler is based on lucene.NET if that can be of some help and is in c#.
I already used iTextSharp for parsing PDF

If you're already using Lucene.NET you might just want to take advantage of the various IFilters already available for doing this. Take a look at the open source SeekAFile project. It will show you how to use an IFilter to open and extract this information from any filetype where an IFilter is available. There are IFilters for Word, Excel, Powerpoint, PDf, and most of the other common document types.

There is an excelent open source project POI, only drawback - it is written for Java.
The .net port is somehow very beta.

Here is a good list of various tools for converting Word documents to plaintext, which you can then do whatever with.

Here's a nice little post on c-charpcorner by Krishnan LN that gives basic code to grab the text from a Word document using the Word Primary Interop assemblies.
Basically, you get the "WholeStory" property out of the Word document, paste it to the clipboard, then pull it from the clipboard while converting it to text format. The clipboard step is presumably done to strip out formatting.
For PowerPoint, you do a similar thing, but you need to loop through the slides, then for each slide loop through the shapes, and grab the "TextFrame.TextRange.Text" property in each shape.
For Excel, since Excel can be an OleDb data source, it's easiest to use ADO.NET. Here's a good post by Laurent Bugnion that walks through this technique.

You might also consider checking out DtSearch (www.DtSearch.com). Although it is primarily a searching tool, it does a great job of extracting text from a large number of file types and is considerably cheaper than other options like the Oracle/Stellent OutsideIn technology or the equivalent from Autonomy.
I've been using DtSearch for years and find it indispensible for this type of task.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.