Creating a Word document from HTML

Creating a Word document from HTML - c#

I know there are a lot of other questions on SO about this topic, but I need some more information. It's a two-part question to my requirement: dynamically generate an MS Word document from HTML and prompt for download.
Q1) From what I'm reading it seems that Microsoft.Office.Interop is not designed to be used for server automation since this is just a wrapper around the application and would require Office to be installed on the web server. Is this correct?
I have gotten some of this to work, I get prompted to download, the Word doc saves properly, but the doc shows my markup as the content of the document, not the rendered HTML as the content. From what I've read, it's supposedly possible to export HTML to MS Word simply like this without the need for 3rd party tools or components. I'd also like to avoid the Open XML format as I can't guarantee which version of Word my users have.
Q2) What am I missing here to get my HTML to appear rendered in the MS Word output file? doc.DocumentBody is a string type that contains the entire HTML document.
public FileStreamResult DownloadDocument(string id)
{
/* pseudo-code here to fetch my custom "Document" object from DB */
Document doc = DocumentService.FindById(id);
var fileName = string.Format("{0}.doc", doc.Title);
Response.AddHeader("Content-Disposition", "inline;filename=" + fileName);
return new FileStreamResult(WordStream(doc.DocumentBody), "application/msword");
}
private static Stream WordStream(string body)
{
var ms = new MemoryStream();
byte[] byteInfo = Encoding.ASCII.GetBytes(body);
ms.Write(byteInfo, 0, byteInfo.Length);
ms.Position = 0;
return ms;
}

I have used essentially the same code as you to download html as word documents, and it works fine. I modified my code so that it was the same as yours to test, and it still worked OK, so I wonder if the issue is actually with your HTML.
Have a look at doc.DocumentBody in your debugger, and see if it is valid html.
Is it wrapped in <html><body></body></html>?
I had a test - I think if you leave out the body tags, you'll end up seeing raw html.

yes, and running Office applications on server without UI is not supported. (Note: "not supported" does not mean it will not work, but simply no guarantees of any kind made).
use File method to return file - http://msdn.microsoft.com/en-us/library/dd505200.aspx, Check out this popular answer - How can I present a file for download from an MVC controller?.

Microsoft.Office.Interop is not designed to be used for server automation since this is just a wrapper around the application and would require Office to be installed on the web server. Is this correct?
Yes.
What am I missing here to get my HTML to appear rendered in the MS Word output file?
Well, you need to create a Word document, of course! Word's file format and the HTML file format are different.
There are some very good commercial libraries out there that provide a nice API for generating Office documents programmatically. With Office XML, this is not quite as necessary - it's now much more feasible to generate the XML that Word knows how to read.

Related

Convert Word doc to docx format in .NET Core using b2xtranslator library

I need to convert .doc and .docx document format to pdf in the server side using .net core. I've searched for it, and it came to this question that has remarkable answered for .docx to pdf issue. It said that you have to convert it first to HTML format using OpenXMLPowerTools, and from HTML to pdf. And you may see in the answer, that there's a solution for the conversion from .doc to .docx, and that using b2xtranslator, a library to convert Microsoft Office binary files to Open XML format files. What I am missing here is the usage of this library. I can't find any sample how to use it to convert the .doc file, but only this comment on this question.
Based on that, I tried to use the library, but I met a dead end. This is my code:
//check file extension
FileInfo file = new FileInfo(textBox1.Text);
if (file.Extension == ".doc")
{
FileStream streamDocFile = new FileStream(file.FullName, FileMode.Open);
var fileDoc = new b2xtranslator.DocFileFormat.WordDocument(new b2xtranslator.StructuredStorage.Reader.StructuredStorageReader(streamDocFile));
var fileDocx = b2xtranslator.OpenXmlLib.WordprocessingML.WordprocessingDocument.Create(file.Name + "x", b2xtranslator.OpenXmlLib.OpenXmlPackage.DocumentType.Document);
b2xtranslator.WordprocessingMLMapping.Converter.Convert(fileDoc, fileDocx);
}
My questions are:
How to write the .docx file? I don't know if the code is right or not, because I am confused about how to write it (fileDocx object) to file and to check it.
How to pass .docx resulting in b2xtranslator, to Open-XML-PowerTools, so I can convert it into HTML format?
Thank you in advance.

Finally, the decision was using 3rd party library for document processing. Because we need library that stable for document processing, and we have short time to finish the project, our company decided to buy 3rd party library.
This answer is not very helpful for those who looking for a free way to process the doc. May you have found a better one.
Thank you

TL;DR see the OP answer that this route was simply too problematic, and paid is usually cheaper (lower Total Cost of Ownership, Faster To Market, and someone else to provide support with the tricky bits.).
Practically all means to convert MS Word processor formats (RTF WPS DOC DOCx) to PDF should be direct such as Adobe Word Plug-in or MS export/save as PDF etc.
If you need to use B2X.net interop(erability) see code using Microsoft.Office.Interop.Word; there are still legacy dependencies to be considered. https://learn.microsoft.com/en-us/archive/blogs/interoperability/binary-to-open-xml-b2x-translator-interoperability-for-the-office-binary-file-formats potentially an older full suite of .net and older MS Office, (my attempts to install older.net failed on win 11).
The second best alternative is use Current Open Office, which should direct convert Doc to PDF exports. Here is the B2X demo.doc showing as default open in LO Writer 7.4 and default export as PDF.
Libre-Office has good command line conversion and "Basic App"/DDE support so you can control any adjustments needed. For current command line in out filters and thus the different types of MS.doc version support see https://help.libreoffice.org/7.4/en-US/text/shared/guide/convertfilters.html?&DbPAR=SHARED&System=WIN
Microsoft WinWord 1/2/5 "MS WinWord 5"
application/msword (doc)
Microsoft Word 6.0 "MS WinWord 6.0"
application/msword (doc)
Microsoft Word 95 "MS Word 95"
application/msword (doc)
Microsoft Word 95 Template "MS Word 95 Vorlage"
application/msword (dot)
Word 97–2003 "MS Word 97"
application/msword (doc wps)
Word 97–2003 Template "MS Word 97 Vorlage"
application/msword (dot wpt)

Generating a .docx from a .dotx using merge (SimpleField) fields

So, first off here's my code to open the dotx and create a new docx copy (of which the copy is then modified). Cut for brevity, but essentially takes 3 params a data table (to make it usable by legacy systems), the UNC path as a string to a template and a UNC path as a string to the output document:
using (WordprocessingDocument docGenerated = WordprocessingDocument.Open(outputPath, true))
{
docGenerated.ChangeDocumentType(WordprocessingDocumentType.Document);
foreach (SimpleField field in docGenerated.MainDocumentPart.Document.Descendants<SimpleField>())
{
string mergeFieldName = GetFieldName(field).Trim();
DataRow[] dr = dtSchema.Select("FieldName = '" + mergeFieldName + "'");
if (dr.Length > 0)
{
string runProperties = string.Empty;
foreach (RunProperties property in field.Descendants<RunProperties>())
{
runProperties = property.OuterXml;
break;
}
Run run = new Run();
run.Append(new RunProperties(runProperties));
run.Append(new Text(dr[0]["FieldDataValue"].ToString()));
field.Parent.ReplaceChild<SimpleField>(run, field);
}
}
docGenerated.MainDocumentPart.Document.Save();
}
What I did initially was take a .dot template and re-save it as a .dotx and crossed my fingers, didn't work. So instead I tried deleting all merge fields in the .dotx and adding them again. This worked - but it would only find one merge field (as a SimpleField), specifically the last one added before saving the .dotx. Looking further at the template using the open XML productivity tool I can see that all other merge fields are of type w:instrText which is why they're being ignored.
I'm literally just starting out with OpenXML as we're looking to replace our current office automation with it so I know very little at this point. Could someone please instruct me a bit further or point me to a good resource? I've Google'd around a bit but I can't find my specific problem. I am trying to put off reading through the whole SDK documentation (I know, I know!) as I need to get a solution put together quickly so am focusing on a single task which is to take our existing .dot templates, convert them to .dotx and just replace merge fields with data to derive a .docx.
Thanks in advance!

Working with OpenXml - you don't strictly need to use .dotx for your templates, instead you can make your templates just using DocX straight away. A good resource of learning OpenXML is obviously http://openxmldeveloper.org/ and you can find a good pdf read there.
Also worth looking at third party API docx.codeplex.com which I am using now for developing server side doc automation solution for my company. See the example http://cathalscorner.blogspot.co.uk/2009/08/docx-v1007-released.html which is similar to your scenario.. merging fields with data..
Hope this helps..

Here is the link to "C# OpenXML Mail Merge Complete Example" http://www.jarredcapellman.com/2012/10/22/c-openxml-mail-merge-complete-example/
I used it to update my web application code that used to work with mergefields in .DOC (.DOT) files (and required MS Word to be installed and the corresponding DCOM to be configured on the web server). Now the solution requires just OpenXML SDK 2.0 to be installed on the web server and the .DOC (.DOT) templates to be saved as .DOCX (.DOTX).

If you want to use content controls you can also try WordDocumentGenerator. WordDocumentGenerator is an utility to generate Word documents from templates using Visual Studio 2010 and Open XML 2.0 SDK # http://worddocgenerator.codeplex.com

Replacing contents inside docx and pdf file using asp.net c#

In my application I am using some templates in docx and pdf format. I am storing this docs to DB as Bytes.
Befor showing/sending this docs back to user or application I need to replace some contents inside the doc. eg:if the doc contain ##username## I need to replace this with the exact username of the customer. I am not getting a proper solution for this. Any good ideas?

For the docx file, your best bet is to use OpenXML, and instead of having special text like ##username##, replace it with a content control that you can fill in.
Since you specified docx, you can use OpenXML, which is great, it's an API. If it has to work with older doc files, then you'll have to automate Word (which should be avoided if at all possible).
For the PDF, your best bet is to create a PDF form, and fill it in a runtime (using a tool like itextsharp).
HTH,
Brian

For DOC / DOCX:
You should use the MSWord object model through MSWord assembly reference (will work only on machines which have msoffice installed.. or else you can use something like ASPOSE word libraries which wont need msoffice installation on server). You can programmatically trigger the Find-Replace context of word through the library's API.
For PDF: You will need a third party library for editing pdf files.. 3rd party libraries like ABCpdf are available.. (not sure whether Adobe itself has something for this)
The same mechanism like for word library.. but I am not sure whether you will be able to trigger the Find-replace context here or do something else... I have not used a pdf generation library.

How to find out what "kind" of document is being displayed in IE

i am using c# to create a button for IE and this button performs certain actions that all depend on the document being a PDF document. I am trying to setup a guard to prevent any action taking place if the document type is not a PDF but not sure how as IE hands over the document to adobe and reader takes charge. I am using both SHDocWv have looked at the WebBrowserClass objects and not sure how to figure this out. any suggestions?

It's a little bit problematic to do this AFAIK.
Value of IWebBrowser2::Type property depends on what plug-in you have installed that handles PDFs, because some plug-ins creates HTML wrapper for PDF file (like Adobe) so you get "HTML Document" as type and some plug-ins don't do this (like Foxit), so you can't relay on this exclusively.
So if you got PDF with HTML wrapper you can use IHTMLDocument2::mimeType to find out exact type of the document (JPEG/GIF/PNG/etc. files are all wrapped in HTML by the browser). But as I know it is unreliable too, for instance on my machine it returns "Firefox Document" for HTML documents because .html files are associated with Firefox :s But I didn't test to see if this is the case with PDFs alos.
Another options is to use GetUrlCacheEntryInfoEx API call to obtain file in local browser cache which stores document, then read it (only the beginning of the file, I think only the first 256 bytes are important) and call FindMimeFromData with data you just read and it will return mime type.

Check mime type of the document or see the window.location.href of webbrowser... If pdf is being displayed, you would be able to find it...

Another good way is to do the following..
1] Cast the Document object to IPersist and then extract the CLSID using .GetClassID(..).
2] pInvoke ProgIDFromCLSID to extract the progId
3] Match the progID against known COM objects / applications.

.NET server based PDF generation

I'd like to dynamically generate content and then render to a PDF file. This processing would take place on a remote hosting server so using virtual printers etc is out. Does any have a recommendation for a .NET library (pref C#) that would work?
I know that I could generate a bunch of PS code and package it myself but I'd prefer something a little less tricksy at this stage.
Thanks!

Have a look at http://itextsharp.sourceforge.net/. Its open source.
Tutorial: http://itextdocs.lowagie.com/tutorial/

I have had good success using SharpPDF.

I have had success using Siberix
http://www.siberix.com/
Corporate License: $350 USD (A single license covers unlimited number of company's developer seats, unlimited number of company's web servers and unlimited number of distributions as a part of your application.)

Free PDF Generator .NET (WkHtmlToPdf wrapper) can generate pretty PDF from HTML template with one line of code:
var pdfBytes = (new NReco.PdfGenerator.HtmlToPdfConverter()).GeneratePdf(htmlContent);
(all you need is one DLL, no external dependencies)

We use the Amyuni PDF Converter and have used it successfully for several years. Our usage is via the COM interface, but it does support a .NET interface.

I've had good experiences with Winnovative's HTML to PDF.
And bad ones with Open Source HTML Doc (Problems with form elements + CSS).

I have been looking for a high performing docx to pdf tool for a while now. Our system has an e-government aspect and is generating a very high number of reports to the user community. At this point, performance is paramount.
Earlier tools I have used did not do simultaneous conversion, instead each exe needed to wait for completion of the other. I have tried Aspose.words and I am very happy with the results.
First of all, it was very easy and seamless to integrate and deploy in our project. Very smooth.
Secondly, the speed of conversion is way better due to the fact that multiple jobs run in parallel.
Thirdly, not only fast, but even with no formatting errors. Considering that we are providing a multi-lingual system and some reports include both English and Arabic fields (mind right-to-left alignment!), this was very important.
And finally, the file size was quite small, which again is very important as tens of thousands of documents are created through our system.
Our first implementation was Microsoft Office Interop library. We convert docx to pdf documents by using below code. This library converts the docx documents to pdf files perfectly and we decided to upload this to report generation server. But after a while, we noticed that conversion operations are waiting for each executable. This causes a big delay on converting the documents at the same time and that's why we start to search a new tool for converting docx files to pdf files.
See Image
Below code shows the how to convert docx documents to pdf files by using Aspose.Words for .NET tool.
See Image 2

RDLC & the Report Viewer controls can generate PDF either at the Client's discretion or at server command which can then be served as a PDF mime-type.

I've used PDF4NET from O2solutions with much success. They support all sorts of scenarios and digital signing of the pdf.

If your data is mostly in XML, you could also look at a XSL-FO solution - we're using Alt-Soft's Xml2Pdf with great success. The "server" version is a bit of a misnomer - it's really just a single DLL you need to include in your Winforms, WPF or ASP.NET app - that's all!
Works like a charm (if you're familiar with XSLT and XSL-FO, or willing to learn it).
Marc

We used a set of third party DLLs from PDFSharp who in turn use DLLs from MigraDoc. I'm not privy to all the reasons that we went that direction (the decision was made by a senior developer), but I can tell you that:
It seems to be in active
development.
It had most of the
features we needed.
The source code
is available. Although it used some
patterns and conventions that I
hadn't seen before, once I got on to
them, it was fairly easy to make the
changes. I added support for using
the System.Drawing.Image directly
rather than as saving files.
It is
not documented well either
internally or externally.

I used iTextSharp in .NET 6 as shown below, but it had an issue of loading scripts and cdn(s) for loading stylesheet, it only works with inline styles. these bytes can be saved using File.WriteAllBytes()
using iTextSharp.text;
using iTextSharp.text.html.simpleparser;
using iTextSharp.text.pdf;
using System.Net.Http;
using PageSize = iTextSharp.text.PageSize;
public static byte[] GenratePdfBytes(string htmlContent)
{
byte[] pdfBytes;
var pdfDoc = new Document(PageSize.A4, 10f, 10f, 10f, 0f);
var html = new StringReader(htmlContent);
var htmlparser = new HTMLWorker(pdfDoc);
using (var memoryStream = new MemoryStream())
{
var writer = PdfWriter.GetInstance(pdfDoc, memoryStream);
pdfDoc.Open();
htmlparser.Parse(html);
pdfDoc.Close();
pdfBytes = memoryStream.ToArray();
}
return pdfBytes;
}

There are a few ways to do this, in my experience, and it depends on the application and complexity of what you are trying to generate and whether the resulting PDF needs to be a commercial print-ready file or just a PDF report for sharing/archiving etc, and what sort out volume output, based on budget. Most higher end PDF libraries come with a large price tag.
I have used various techniques based no the complexity, there are libraries to generate PDF (build PDF elements from the ground up) in this case you could use something like iText or others that can add content on top of a PDF.
If you need to do minor adjustments i.e. use an existing PDF as a template and add some content (text/images) there are libraries that can just stamp text and images on top. (eg: http://www.pdfsharp.net/)
If you generating invoices or reports, you could use an HTML template, merge data (replacing {tokens} etc) and then convert the html to pdf using a different type of mechanism (eg: https://www.nrecosite.com/pdf_generator_net.aspx)
There are API's if you need full control over styling, client generated templates (idml) etc, you can integrate with InDesign Server and use that to generate print ready PDF files. I have build an API like this but this is another level of PDF generation.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.