I'm trying to make a program in C# which should put text into an opened LibreOffice document (Writer).
A first the user can make some decisions about the text (saved to string variables) and when clicking on a button it should put the text from these strings to the document.
How can I do that?
Libre Office uses Open Document Format (ODF) (its actually an XML based format and is usually compressed by using zip) which is an easy format to work with ,I have found AODL to be the only openSource library (check below links) and I'm also sure .NET libraries can do the heavy lifting for you, here are some tutorials and links to help you out.
AODL allows your application to support the OpenDocument Format.(OpenSource .NET Library)
How to Read and Write ODF/ODS File
Read and write ODF/ODS files (OpenDocument Spreadsheets)
Related
I'd like to implement support for these types of files in my application, but for this I need something that will let me extract raw text from these filetypes.
I'm looking for either a solution that doesn't require any additional libraries, or an all-in-one library/NuGet package. I took a look at GemBox.Document but it doesn't seem to be working with UWP projects.
What would be the best option for this?
I'm looking for either a solution that doesn't require any additional libraries, or an all-in-one library/NuGet package.
There is no such package.
In the standard UWP app we can read .rtf file with the Rich edit box, there is code sample in this document shows how to edit, load, and save a Rich Text Format (.rtf) file in a RichEditBox.
For .doc, .docx, aka. MS Word document, especially the version after 2007, it uses Open-XML-SDK and currently it doesn't support UWP platform.
For .pdf documents, you can refer to #Franklin Chen's thread: [UWP]PDF Viewing on a Windows Universal App.
For epub files, it is a ZIP archive file, to parse this file, you can refer to the thread: [WP8.1][C#] How can i read an EPub file in c# on Windows Phone!?.
For mobi files, sorry I couldn't find any useful information for development for the moment, I can only now suggest to convert it to pdf file with free online service.
But in a word, since Open-XML-SDK currently doesn't support UWP platform. It is not possible to find a solution or package for standard UWP app. You can try to find such a web service and implement this service in your app, or you can use commercial libraries which can read documents in all these formats.
In my application I am using some templates in docx and pdf format. I am storing this docs to DB as Bytes.
Befor showing/sending this docs back to user or application I need to replace some contents inside the doc. eg:if the doc contain ##username## I need to replace this with the exact username of the customer. I am not getting a proper solution for this. Any good ideas?
For the docx file, your best bet is to use OpenXML, and instead of having special text like ##username##, replace it with a content control that you can fill in.
Since you specified docx, you can use OpenXML, which is great, it's an API. If it has to work with older doc files, then you'll have to automate Word (which should be avoided if at all possible).
For the PDF, your best bet is to create a PDF form, and fill it in a runtime (using a tool like itextsharp).
HTH,
Brian
For DOC / DOCX:
You should use the MSWord object model through MSWord assembly reference (will work only on machines which have msoffice installed.. or else you can use something like ASPOSE word libraries which wont need msoffice installation on server). You can programmatically trigger the Find-Replace context of word through the library's API.
For PDF: You will need a third party library for editing pdf files.. 3rd party libraries like ABCpdf are available.. (not sure whether Adobe itself has something for this)
The same mechanism like for word library.. but I am not sure whether you will be able to trigger the Find-replace context here or do something else... I have not used a pdf generation library.
I need to make an upload tool where in the Word document will be converted to HTML format for saving to database. Any idea?
I've written one (see the Doc to HTML Converter).
To implement it, I downloaded the PIAs for Word, which let me open a document using Word, and control the format in which Word then re-saves the document.
Alternatively (instead of doing it yourself) there are tools like mine (and others, more famous) which you can use (some of which don't even use Word).
I know this is an old post, but I just wrote an app that converts a Word-doc to a usable web-page. The app provides some of the requirements in the OP.
The app is WordWebNav (WWN). It's free and open-source.
WWN provides a Word VBA program that converts Word-docs to Word-HTML.
WWN also provides a Python program that converts the Word-HTML to a usable web-page:
It adds missing features to the Word-HTML, e.g., a navigation pane.
And, WWN fixes some common bugs in Word's HTML, e.g., mis-formatted lists, and overly-wide paragraphs.
The Python program uses a CLI, and it can be called externally.
If this is a client application and you have access to Word, why not automate Word? Word can save in HTML (although you will probably have to clean the HTML up a bit). However, I will warn you that this is not very portable; whoever is going to use application will need to have the same version of Word you developed it with.
I have a need to populate a Word 2007 document from code, including repeating table sections - currently I use an XML transform on the document.xml portion of the docx, but this is extremely time consuming to setup (each time you edit the template document, you have to recreate the transform.xsl file, which can take up to a day to do for complex documents).
Is there any better way, preferably one that doesn't require you to run Word 2007 during the process?
Regards
Richard
I tried myself to write some code for that purpose, but gave up. Now I use a 3rd party product: Aspose Words and am quite happy with that component.
It doesn't need Microsoft Word on the machine.
"Aspose.Words enables .NET and Java applications to read, modify and write Word® documents without utilizing Microsoft Word®."
"Aspose.Words supports a wide array of features including document creation, content and formatting manipulation, powerful mail merge abilities, comprehensive support of DOC, OOXML, RTF, WordprocessingML, HTML, OpenDocument and PDF formats. Aspose.Words is truly the most affordable, fastest and feature rich Word component on the market."
DISCLAIMER: I am not affiliated with that company.
Since a DOCX file is simply a ZIP file containing a folder structure with images and XML files, you should be able to manipulate those XML files using our favorite XML manipulation API. The specification of the format is known as WordprocessingML, part of the Office Open XML standard.
I thought I'd mention it in case the 3rd party tool suggested by splattne is not an option.
Have you considered using the Open XML SDK from Microsoft? The only dependency is on .NET 3.5.
Documentation: http://msdn.microsoft.com/en-us/library/bb448854%28office.14%29.aspx
Download: http://www.microsoft.com/downloads/details.aspx?familyid=C6E744E5-36E9-45F5-8D8C-331DF206E0D0&displaylang=en
Use invoke docx lib. it supports table data (http://invoke.co.nz/products/help/docx_tables.aspx). More info at http://invoke.co.nz/products/docx.aspx
Have you considered using VB? You could create a separate assembly to populate your document.
I know you are looking for a C# solution, but the XML literal support is one area where XML literal support could help you populate the document. Create a document in Word to server as a template, unzip the docx, paste the relevant XML section you want to change into you VB code, and add code to fill in the parts you wish to change. It's difficult to say from your description if this would meet your requirements but I would suggest looking into it.
I`d like to be able to read the content of office documents (for a custom crawler).
The office version that need to be readable are from 2000 to 2007. I mainly want to be crawling words, excel and powerpoint documents.
I don`t want to retrieve the formatting, only the text in it.
The crawler is based on lucene.NET if that can be of some help and is in c#.
I already used iTextSharp for parsing PDF
If you're already using Lucene.NET you might just want to take advantage of the various IFilters already available for doing this. Take a look at the open source SeekAFile project. It will show you how to use an IFilter to open and extract this information from any filetype where an IFilter is available. There are IFilters for Word, Excel, Powerpoint, PDf, and most of the other common document types.
There is an excelent open source project POI, only drawback - it is written for Java.
The .net port is somehow very beta.
Here is a good list of various tools for converting Word documents to plaintext, which you can then do whatever with.
Here's a nice little post on c-charpcorner by Krishnan LN that gives basic code to grab the text from a Word document using the Word Primary Interop assemblies.
Basically, you get the "WholeStory" property out of the Word document, paste it to the clipboard, then pull it from the clipboard while converting it to text format. The clipboard step is presumably done to strip out formatting.
For PowerPoint, you do a similar thing, but you need to loop through the slides, then for each slide loop through the shapes, and grab the "TextFrame.TextRange.Text" property in each shape.
For Excel, since Excel can be an OleDb data source, it's easiest to use ADO.NET. Here's a good post by Laurent Bugnion that walks through this technique.
You might also consider checking out DtSearch (www.DtSearch.com). Although it is primarily a searching tool, it does a great job of extracting text from a large number of file types and is considerably cheaper than other options like the Oracle/Stellent OutsideIn technology or the equivalent from Autonomy.
I've been using DtSearch for years and find it indispensible for this type of task.