Batch conversion of docx to clean HTML - c#

I'm starting to wonder if this is even possible. I've searched for solutions on Google and come up with nothing that works exactly how I'd like it to.
I think it'd benefit to explain what that entails. I work for database group at my university's IT department. My main job is to take specs of a report in a docx file, copy that over to dreamweaver, fix some formatting, and put it onto their website. My issue is that it's ridiculously tedious to do this over and over. I figured, hey, I haven't written anything in C# for some time now, perhaps I could write an application to grab a docx file, convert it to HTML, fix the CSS, stick the header, and footer from the webpage on there, and save the result. I originally planned to have it do one by one, but it probably wouldn't be difficult to have it input a list of files and batch convert.
I've found these relevant topics on how to accomplish this, but they don't fit my needs well enough.
http://www.techrepublic.com/blog/howdoi/how-do-i-modify-word-documents-using-c/190
This is probably fine for a few documents, but since it's just automating an instance of Word, I feel like it'd be slow and memory intensive. I'd prefer to avoid opening and closing an instance of Word 50+ times.
http://openxmldeveloper.org/articles/333.aspx
This is what I started using. XSLT had the benefit of not needing word to be installed nor ran for each file. After some searching I got a proof of concept working. It takes in a docx file, decompresses it, grabs the document.xml from that, and uses the DocX2Html.xsl file I scavenged from OpenXML viewer. I believe that was originally provided by MS for sharepoint servers to provide the ability to render word documents in a browser. Or something along those lines.
After adjusting that code to fit my needs, and having issues with the objXSLT.Load () method, I ended up using IlMerge to make the XSL into a DLL. No idea why I kept getting a compile error when using the plain old XSL file, but the DLL worked fine, so I was satisfied. Here (http://pastebin.com/a5HBAakJ) is my current code. It does the job of converting docx to HTML just fine (other than random spaces between some words), but the result file has ridiculously ugly HTML syntax. An example of this monstrosity can be found here (http://pastebin.com/b8sPGmFE).
Does anyone know how I could remedy this? I'm thinking perhaps I need to make a new XSL file, as the one MS provided is what's responsible for sticking all those tags and extra code in there. My issue with that is that I don't know anything about how to do that. Perhaps there's an alternative version already out there. All I'd need is one that will preserve tables and text formatting. Images aren't needed.

This looks like just what you need: http://msdn.microsoft.com/en-us/library/ff628051(v=office.14).aspx
The author Eric White blogged about his experiences developing that tool. You can see that list of posts on his blog here: http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of-contents.aspx#Open_XML_to_XHtml

Since I'm a big fan of Aspose.Words, a commercial library to create/process Word documents, I would do something like:
Open the Word document with Aspose.Words.
Save the Word document as HTML.
Use something like SgmlReader or HTML Agility Pack (or even Regular Expressions if it is suitable) to remove unwanted HTML tags/attributes.
Since you wrote you work at an university, I'm not sure whether commercial packages are an option, though.

Hi not sure what the rules are on promoting your own solutions, so do let me know if I am out of line.
I am a web developer who had the same issues, so I created my own tool:
http://www.convertwordtohtml.com
We are also working on a new version that will have even better conversion quality and one click conversion eg you can right click on a word file and it will be directly converted to html and the code placed into the clipboard. The current version also supports command line access and the new version will have a server version to.
There is a free trial version downloadable from the site , and if you have any questions do contact me any time.

Related

Is it at all possible to convert a document to PDF or edit a PDF in C# using only free software?

I had this stupid idea of creating a template as a .docx or .rtf or .pdf and then replacing the text in that document to generate reports. This seemed like a better way of doing it than using paid reporting software.
Well, I believe I've tried just about everything now and I'm amazed at how impossible it is to do anything with pdfs.
Try 1
HTML -> PDF
A lot harder to design the template. It doesn't look the same when you print it. Never got it working outside of a command line example (not sure how well, say, iTextSharp-LGPL would even work or if it could handle base64 strings as I'm not sure how else you are going to tell it about images). In any case, doing it this way makes it too hard to design the template.
Try 2
OpenXml -> PDF
I stupidly assumed that because Word could save as PDF that OpenXml could to. I was wrong. It cannot save as a PDF.
Try 3
OpenOffice/LibreOffice (docX -> PDF)
It can't read OpenXml which is a problem because I was editing the template as OpenXml and then saving that result (as a .docx) but it can't read that saved document.
Try 4
iTextSharp LGPL
This one just doesn't work, lol. And apparently even though when you google "convert rtf to pdf" the ONLY thing that comes up is iText and its derivatives it doesn't convert rtf documents to pdf documents. I verified this myself (it only saves the text not the formatting) and later found this post to convince me I wasn't doing something wrong.
Try 5
PDF -> PDF
Since converting ANYTHING to a PDF seems to be impossible maybe I can save the template as a PDF and just do a text replace on that. Nope, lol, that is apparently a very difficult thing to do.
Try 6
Pandoc (.odt/.docx -> pdf), (.rtf -> .pdf not supported)
pandoc mockup2.odt -s -o mockup2.pdf
link to the files in the picture. *note, it messes up in the same way if you try converting .odt/.docx to .tex.
What do I do here? Buy software so that I can save a file as PDF? Is that the only option?
I have a solution. I'm not saying it's the best solution. LibreOffice (or possibly OpenOffice if you are so inclined) accepts command line arguments that will do the switch.
soffice.exe --headless --convert-to pdf mockup.odt
*note - this is after I added libreoffice to my path (C:\Program Files\LibreOffice\program). idk why it's called soffice.exe instead of libreoffice.exe.
where i found the answer
relevant documentation
I might have a working solution for you, if you are stuck with the docx-file for the template.
I found one free solution for docx to pdf conversions, without using microsoft.interop, etc.: See first answer in this stack overflow post
It uses two tools: The open xml power tools and DinkToPdf (Which is essentially a wkhtmltopdf wrapper). The html to pdf part works just fine, but the docx to html part looks like a catastrophe at first. You can fix this with custom css (There are some resources online).
Powertools-.NetStandard
DinkToPdf-GitHub
There are more possibilities for proprietary software, like Asposes.Words and Syncfusion file-formats. Most of the proprietary solutions are pretty expensive...
If you are just working on a Windows Environment, where MS-Office is installed, you can use Microsoft.Interop. It is by far the easiest solution (In this post, Interop is mentioned several times Stackoverflow Word to PDF
If you found another (better) working solution, please let me know. I still have not decided if I will use a proprietary or a free solution. :-)

Server Side HTML to PDF

I'm trying to find a C# library that will allow me to "Print" one of my HTML pages to a PDF file. I can't seem to find out if one currently exists that will allow you to do this. I've found several that will let you build a page, but haven't noticed if one would generate the pdf only based off of HTML.
EDIT: I'm not allowed a budget on this at work so it will need to be an open source/free product. If not I'm aware of iTextSharp and will have to generate the pdf programmatically (which is what I'm hoping to avoid :) )
I've had a lot of luck with ActivePDF WebGrabber. It's kind of odd to use compared to standard managed libraries (ActivePDF is unmanaged), but it gets the job done.
iTextSharp comes with a little companion : XML Worker
For a demo, have a look here
Even though the documentation refers to the Java API, the adaptation to C# should be straightforward.
I've experimented with itextsharp and it works for basic conversion, but gets complicated when you get into styles and formatting. I've also heard wkhtmltopdf is out there as another option.

word document, how to edit the right way in asp.net application

I have asp.net app in which I need to edit a word document and than send that document in email as an attachment.
I would like to know what will be the best way to edit the word document and than use it.
The document already has data and there are few variables such as "company name", "date", "amount", etc that I am searching in the document and I am replacing them with values from within the code.
The code works great when I am running it locally but from some people I received answers that editing word document on the server shouldn't be the way I am doing now but I need to use either openxml to edit the document or google docs.
Any idea what's the best way to tackle this?
I would vote for OpenXML, but be prepared to spend a good day or two reading how to use the API for .NET and be patient. =)
I remember using this tool -
http://openxmldeveloper.org/resources/dotnet/m/cc/303.aspx - quite a bit to find the relevant parts in the document to modify. You basically load a Word document and can "drilldown" to find the parts you want to modify. You can actually write some pretty clean code to search the document for your textual markers and then replace them with data.
(I hope I understood the question correctly. You said you already had working code, so I wasn't sure what the question was.)
You can use the Open XML Format SDK as per http://msdn.microsoft.com/en-us/library/dd440953%28v=office.12%29.aspx
For what you're doing though, I think your approach is fine.
I have a fair amount of screwdrivers, but if I noticed a lose screw in the stool in front of me, I might just use the knife on the table because it will do the job perfectly adequately and save me a trip to the toolbox. It's not a tool designed for that job, but it is a tool that would do the job just as well and with less effort.
Now, if I decided to set about a day's worth of DIY with only the knife instead of the set of screw-drivers, that would be going to the other extreme. Here I'd have long-ago crossed the line where using the tools designed for a given job would have made my life much easier.
It's just the same with software tools.
One of the very points of XML formats is that we can do simple tasks with it treating it just as text. Yeah, we none of us want to be the guy with a 3-page-long regular expression with which they're trying to parse a complicated XML document, but when the problem naturally breaks down to a simple text substitution, do a simple text substitution.

How to generate an index at end of a PDF file?

Given an existing PDF document, I would like to tack on an index to the end of the file to show the pages on which key words show up. It would be best if I don't have to give a list of words to look for and the list of words is automatically generated. However, if a list of words must be given, I can work with it. I'm looking to do this either through a C# library or a command line tool. It needs to run as part of another command line app.
Is there anything out there that is capable of this?
This "PDF Index Everthing" (http://www.pdfstore.com/details.asp?ProdID=799) seems to be on the right track, but requires interaction through its GUI.
I don't actually have an c# solution but hopefully this will still help...
pdflib is an excellent pdf development library. It is one of the better libs available. As far as I know it doesn't have a C# binding. PDF is a random access object-based file format and although there are many libraries that allow for creating of pdfs, most freely available libs don't support adding pages to existing pdfs. pdflib does support adding pages with it's pdi option, so it may be worth checking out.
Updated Info:
Check out- iText# library and
merging pdf files with C# and iText

Using Word 2007 as CMS page editor

I have been searching for several hours but i couldn't find anything about this... Basically I would like to create a template or plug-in for word 2007 that would allow someone to create new pages for a CMS. What I have in mind is something similar to blog post template. I know how to create a basic template but I can't find a way to publish the created document using a publish button inside the Word.
thnx in advance
I understand what you are trying to achieve, but Word is the wrong starting point. I would start with a much more basic text editor.
Word is horrible, horrible, horrible. Your site will define clear styles, yet Word will output nasty HTML that won't match your website's CSS definitions.
Your best bet therefore is to have a means to drop the Word file into the site, and have code programmatically analyse it and transform it into site-valid HTML. In Java you could use Apache POI, but that's very raw still. Might be a lot easier in a Microsoft centric world.
Far better, in my opinion, is to force people to learn Markdown, or BBCode, or HTML, or to use a Styled HTML Editor in your CMS - cut and paste plain text in, then style with the CMS defined styles.
As you are using Word 2007 you can export the document as XML and then use XSLT to generate the HTML.
If your CMS has an API or import facility you could convert the output from Word to suit that interface.
You can write a Word macro to add a Publish button/menu option to Word that will generate the correct output.
It's not a bad idea since it's all about the end user. If Word produces bad HTML, you should just make it semantic correct before posting it to the CMS.
I've never done this but I'm sure that it's possible to with .NET via the "Word 2007 Addin"-template (assuming Office 2007).
Good luck!
You can do what you want if you use SharePoint 2007 as your CMS. You can set up a blog on SharePoint 2007 and post to the blog from Word. If you use Office 2007 on the client end then you will get some nice buttons like "post to my blog" etc.
If you can't use SharePoint or are talking about an existing CMS, you have a lot of hurdles to jump through. This is a major undertaking and not something you can get a simple answer out of Stack Overflow.
Have you considered using one of the freely available Javascript WYSIWYG Editors such as TinyMCE http://tinymce.moxiecode.com/? When configured with all the options, it has an impressive amount of functionality and the interface is very similar to Word. I realize this doesn't directly answer your question, but as others have pointed out starting from Word is going to be difficult.
I've been on a team that wrote a Word addin for a custom CMS system. It was written in VB6 and was able to take a Word document and turn basic formatting information - lists, bold, italic and even tables into HTML, which was uploaded to the server. It didn't create new pages or manage the site in the addin though.
I would definitely avoid choosing Word as the editor for your CMS from my experience. The biggest issue is each time you want to update the addin you have to redistribute it to the company or companies using it. You can do this is as an IE active-x control but it's far easier just to handicap the user to a limited set of styling options via a Javascript editor.
Word does have a powerful API for manipulating your content with, however we needed to disable so many options in Word to avoid unwanted fonts and so on, it resembled Wordpad more than Word in the end.
If it's a greenfield project and you have the time, I would infact recommend using Silverlight 4.0 over a Javascript editor. Version 4.0 has a richtextbox control built in, plus there is also the excellent Vectorlight one.
May be it helps you, umbraco CMS allow editing with Microsoft Word.
For some reason this is a feature that Excel enjoys but not Word.
Excel can can automatically publish an HTML file version of your document when you save it.
Unfortunately Word seems to only be able to achieve this functionality when using Sharepoint, which is a shame because it can be quite useful.
What you can do, short of creating your own add-in is to add a bit of code to your template to create a HTML copy of your document whenever the user saves it.
First, make sure your template is macro-enabled (saved as .dotm file).
Second, while editing the template in Word, open the VBA code editor (ALT-F11)
In the project list double-click on your document to open its code-behind file.
Add the following bit of code to it, modifying the ActiveDocument.SaveAs path to something more appropriate to you, like a shared network folder where your CMS exposed by your CMS.
Sub FileSave()
' First Save the main document
ActiveDocument.Save
' Now we create a new document based on the current one
Selection.WholeStory
Selection.Copy
Documents.Add
Selection.PasteAndFormat wdPasteDefault
' Save it as HTML and close it
ActiveDocument.SaveAs "c:\temp\mydoc.html", fileformat:=wdFormatHTML
ActiveDocument.Close
End Sub
This will copy the original file into a blank new one that will be saved to HTML and closed before returning to the original file.
You can check some of the options to the Documents.Add if you want to use a different template than the normal one.
Security
because this template contains macros, you will have to install it with the other templates where Word expect them.
If you don't, then you'll get a security warning.
To avoid getting it, you can add the path where your templates are located to the list of Trusted Locations under Word's Options > Trust Center > Trust Center Settings > Trusted Locations.

Categories

Resources