Best way to write a word document in C# (Bi-Directional)

Best way to write a word document in C# (Bi-Directional) - c#

I've been struggling in the last few days, trying to write a Word document.
I've tried downloading DocX by (Novacode) which was not a big success, then moved to Microsoft.Office.Interop.Word library which was better but still, not a huge success.
The problem is that I'm trying to write a Right-To-Left document, which is of course mixed with different punctuation. The moment I add punctuation the entire line is reversed.
I get many lines written from Database, write them the way they are in the document, and I can not manipulate them, unlike titles and stuff, which I can manipulate, reverse stuff and get the lines the way I want, after struggling.
I've seen some answers saying I should use a specific char which 'tells' the reading algorithm it is about to face a Right-To-Left line, but here most data is derived from database.
Has anyone faced that kind of problem and can give some advices?

To whoever finds it relevant, and none of the above helped, I found this answer the best for Right-To-Left documents:
oDoc.Paragraphs.ReadingOrder = Word.WdReadingOrder.wdReadingOrderRtl;

Did you try with the Open XML SDK and using the BiDi class?
http://msdn.microsoft.com/en-us/library/dd452407(v=office.12).aspx

Related

How to extract text from a Word document?

I am trying to extract specific text from a Word document based on coordinates. I have searched many sites on this requirement, but with no luck. How to set the coordinates to the Word document text?

The accepted answer in this question has a solution for finding the text of a certain line number in a Word doc.
Obviously you'll need a bit of extra code to search the strLine variable for a specific substring or whatever, but the hard work is done there I think.

Depending on the format of the Word file, there are two object models. The older style .doc files used one that had paragraphs, tables, and the like. The .docx files have an XML based structure that's a totally different model.
If you need to support both formats, you've got your work cut out for you.
Here's a link to the documentation:
Word Object Model

Excel Formula to Pseudocode

Is there any tool out there which will format excel formulas in such a way that they are more easily decipherable?
I need to convert some complex excel spreadsheets people have made to C# applications and sitting there looking at one line excel formulas is relatively troublesome. Primarily I'm looking for something that can rewrite them to pseudocode or a more readable programming language.
The closest thing I could find was http://ewbi.blogs.com/develops/2004/12/excel_formula_p.html but this still does not help all that much.

Sean Cheshire's comment is probably the best answer you're going to get:
Excel Formula Beautifier
It has a feature that allows you to convert to javascript.
The one thing it seems to be missing is a feature to do a batch of formulas at once (which would save a lot of time if converting a whole sheet with "Show Formulas" turned on). I added a request for this.
Given how the app works now, if you try pasting a batch into the single-line input field, you may still get a result that's at least somewhat useful. I recommend giving it a try.

word document, how to edit the right way in asp.net application

I have asp.net app in which I need to edit a word document and than send that document in email as an attachment.
I would like to know what will be the best way to edit the word document and than use it.
The document already has data and there are few variables such as "company name", "date", "amount", etc that I am searching in the document and I am replacing them with values from within the code.
The code works great when I am running it locally but from some people I received answers that editing word document on the server shouldn't be the way I am doing now but I need to use either openxml to edit the document or google docs.
Any idea what's the best way to tackle this?

I would vote for OpenXML, but be prepared to spend a good day or two reading how to use the API for .NET and be patient. =)
I remember using this tool -
http://openxmldeveloper.org/resources/dotnet/m/cc/303.aspx - quite a bit to find the relevant parts in the document to modify. You basically load a Word document and can "drilldown" to find the parts you want to modify. You can actually write some pretty clean code to search the document for your textual markers and then replace them with data.
(I hope I understood the question correctly. You said you already had working code, so I wasn't sure what the question was.)

You can use the Open XML Format SDK as per http://msdn.microsoft.com/en-us/library/dd440953%28v=office.12%29.aspx
For what you're doing though, I think your approach is fine.
I have a fair amount of screwdrivers, but if I noticed a lose screw in the stool in front of me, I might just use the knife on the table because it will do the job perfectly adequately and save me a trip to the toolbox. It's not a tool designed for that job, but it is a tool that would do the job just as well and with less effort.
Now, if I decided to set about a day's worth of DIY with only the knife instead of the set of screw-drivers, that would be going to the other extreme. Here I'd have long-ago crossed the line where using the tools designed for a given job would have made my life much easier.
It's just the same with software tools.
One of the very points of XML formats is that we can do simple tasks with it treating it just as text. Yeah, we none of us want to be the guy with a 3-page-long regular expression with which they're trying to parse a complicated XML document, but when the problem naturally breaks down to a simple text substitution, do a simple text substitution.

How do I explore a PDF to determine if an element is text?

I have a PDF and want to extract the text contained in it. I've tried a few different PDF libraries and they all return basically the same results. When extracting the text from a two page document with literally hundreds of words, only a dozen or so words from the header are returned.
Is there any way to tell if the text I'm after is actually text or a raster image of the text? I'm thinking something along the lines of Firebug's "Inspect Element" but at this point I'll take any solution that tells what I'm really looking at.
This project really doesn't justify attempting to use OCR. And, although a simple solution, using fields in the PDF is not an option since the generator of the file is a third party.

If Acrobat/Reader can select the text, then it Is Text.
Reasons your library might not be able to find the text in question:
Complex/bad fonts or encodings. Adobe can be very forgiving of garbage in, somehow managing to get Good Info out.
The text could be in an annotation rather than the page contents. It won't matter what program parses the content stream if you need to look in the annot array instead.
You didn't name a particular library, so it's possible that the library you're using doesn't look inside XObject Forms. That's unlikely in an even remotely mature API, but stranger things have happened.
If you can get away with copy/pasta from Reader, then just go that route.

Have you tried Amyuni PDF Creator .Net? It allows you to enumerate all components from a specified rectangular region of a page and inspect their type from a predefined types list. You could run a quick test using the trial version and the following code sample for text extraction:
// open a PDF file
axPDFCreactiveX1.Open(System.IO.Directory.GetCurrentDirectory()+"\\sampleBookmarks.pdf", "");
axPDFCreactiveX1.Refresh ();
String text = axPDFCreactiveX1.GetRawPageText (1);
MessageBox.Show (text);
Additionally, it provides Tesseract OCR integration in case you needed it.
Disclaimer: I am part of the development team of this product.

Check this site out. It may contain some helpful code snippets. http://www.codeproject.com/KB/cs/PDFToText.aspx

Batch conversion of docx to clean HTML

I'm starting to wonder if this is even possible. I've searched for solutions on Google and come up with nothing that works exactly how I'd like it to.
I think it'd benefit to explain what that entails. I work for database group at my university's IT department. My main job is to take specs of a report in a docx file, copy that over to dreamweaver, fix some formatting, and put it onto their website. My issue is that it's ridiculously tedious to do this over and over. I figured, hey, I haven't written anything in C# for some time now, perhaps I could write an application to grab a docx file, convert it to HTML, fix the CSS, stick the header, and footer from the webpage on there, and save the result. I originally planned to have it do one by one, but it probably wouldn't be difficult to have it input a list of files and batch convert.
I've found these relevant topics on how to accomplish this, but they don't fit my needs well enough.
http://www.techrepublic.com/blog/howdoi/how-do-i-modify-word-documents-using-c/190
This is probably fine for a few documents, but since it's just automating an instance of Word, I feel like it'd be slow and memory intensive. I'd prefer to avoid opening and closing an instance of Word 50+ times.
http://openxmldeveloper.org/articles/333.aspx
This is what I started using. XSLT had the benefit of not needing word to be installed nor ran for each file. After some searching I got a proof of concept working. It takes in a docx file, decompresses it, grabs the document.xml from that, and uses the DocX2Html.xsl file I scavenged from OpenXML viewer. I believe that was originally provided by MS for sharepoint servers to provide the ability to render word documents in a browser. Or something along those lines.
After adjusting that code to fit my needs, and having issues with the objXSLT.Load () method, I ended up using IlMerge to make the XSL into a DLL. No idea why I kept getting a compile error when using the plain old XSL file, but the DLL worked fine, so I was satisfied. Here (http://pastebin.com/a5HBAakJ) is my current code. It does the job of converting docx to HTML just fine (other than random spaces between some words), but the result file has ridiculously ugly HTML syntax. An example of this monstrosity can be found here (http://pastebin.com/b8sPGmFE).
Does anyone know how I could remedy this? I'm thinking perhaps I need to make a new XSL file, as the one MS provided is what's responsible for sticking all those tags and extra code in there. My issue with that is that I don't know anything about how to do that. Perhaps there's an alternative version already out there. All I'd need is one that will preserve tables and text formatting. Images aren't needed.

This looks like just what you need: http://msdn.microsoft.com/en-us/library/ff628051(v=office.14).aspx
The author Eric White blogged about his experiences developing that tool. You can see that list of posts on his blog here: http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of-contents.aspx#Open_XML_to_XHtml

Since I'm a big fan of Aspose.Words, a commercial library to create/process Word documents, I would do something like:
Open the Word document with Aspose.Words.
Save the Word document as HTML.
Use something like SgmlReader or HTML Agility Pack (or even Regular Expressions if it is suitable) to remove unwanted HTML tags/attributes.
Since you wrote you work at an university, I'm not sure whether commercial packages are an option, though.

Hi not sure what the rules are on promoting your own solutions, so do let me know if I am out of line.
I am a web developer who had the same issues, so I created my own tool:
http://www.convertwordtohtml.com
We are also working on a new version that will have even better conversion quality and one click conversion eg you can right click on a word file and it will be directly converted to html and the code placed into the clipboard. The current version also supports command line access and the new version will have a server version to.
There is a free trial version downloadable from the site , and if you have any questions do contact me any time.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.