I am trying to figure out how excel files are written, so I can read and edit them in C# without a library (because I like to make work for myself like that). When opening them in notepad, all you can see is strange wingding characters, so immediately I thought, I might get some results when reading in a byte array. No luck, I do get a sensibly sized byte array, but converting it to a string ends up with a useless result!
So I guess my question is, how are excel files written, and how can I read them without a library?
A good place to start would be this project : Excel Data Reader CodePlex
It has working code, and you'll be able to learn the older format "xls". "XLSX" is really more the new OpenXML document standard. A link to an SDK from Microsoft for that:
Open XML for Office Developers (Microsoft)
I wanted to learn the formatting myself for some mono work I've been doing.. the older file format "XLS" isn't exactly something that you'd "learn". Pretty complicated if you ask me. Ended up just waiting for a mono product to appear rather than buying source from a number of vendors.. and porting to mono.
Related
I had this stupid idea of creating a template as a .docx or .rtf or .pdf and then replacing the text in that document to generate reports. This seemed like a better way of doing it than using paid reporting software.
Well, I believe I've tried just about everything now and I'm amazed at how impossible it is to do anything with pdfs.
Try 1
HTML -> PDF
A lot harder to design the template. It doesn't look the same when you print it. Never got it working outside of a command line example (not sure how well, say, iTextSharp-LGPL would even work or if it could handle base64 strings as I'm not sure how else you are going to tell it about images). In any case, doing it this way makes it too hard to design the template.
Try 2
OpenXml -> PDF
I stupidly assumed that because Word could save as PDF that OpenXml could to. I was wrong. It cannot save as a PDF.
Try 3
OpenOffice/LibreOffice (docX -> PDF)
It can't read OpenXml which is a problem because I was editing the template as OpenXml and then saving that result (as a .docx) but it can't read that saved document.
Try 4
iTextSharp LGPL
This one just doesn't work, lol. And apparently even though when you google "convert rtf to pdf" the ONLY thing that comes up is iText and its derivatives it doesn't convert rtf documents to pdf documents. I verified this myself (it only saves the text not the formatting) and later found this post to convince me I wasn't doing something wrong.
Try 5
PDF -> PDF
Since converting ANYTHING to a PDF seems to be impossible maybe I can save the template as a PDF and just do a text replace on that. Nope, lol, that is apparently a very difficult thing to do.
Try 6
Pandoc (.odt/.docx -> pdf), (.rtf -> .pdf not supported)
pandoc mockup2.odt -s -o mockup2.pdf
link to the files in the picture. *note, it messes up in the same way if you try converting .odt/.docx to .tex.
What do I do here? Buy software so that I can save a file as PDF? Is that the only option?
I have a solution. I'm not saying it's the best solution. LibreOffice (or possibly OpenOffice if you are so inclined) accepts command line arguments that will do the switch.
soffice.exe --headless --convert-to pdf mockup.odt
*note - this is after I added libreoffice to my path (C:\Program Files\LibreOffice\program). idk why it's called soffice.exe instead of libreoffice.exe.
where i found the answer
relevant documentation
I might have a working solution for you, if you are stuck with the docx-file for the template.
I found one free solution for docx to pdf conversions, without using microsoft.interop, etc.: See first answer in this stack overflow post
It uses two tools: The open xml power tools and DinkToPdf (Which is essentially a wkhtmltopdf wrapper). The html to pdf part works just fine, but the docx to html part looks like a catastrophe at first. You can fix this with custom css (There are some resources online).
Powertools-.NetStandard
DinkToPdf-GitHub
There are more possibilities for proprietary software, like Asposes.Words and Syncfusion file-formats. Most of the proprietary solutions are pretty expensive...
If you are just working on a Windows Environment, where MS-Office is installed, you can use Microsoft.Interop. It is by far the easiest solution (In this post, Interop is mentioned several times Stackoverflow Word to PDF
If you found another (better) working solution, please let me know. I still have not decided if I will use a proprietary or a free solution. :-)
I'm starting to wonder if this is even possible. I've searched for solutions on Google and come up with nothing that works exactly how I'd like it to.
I think it'd benefit to explain what that entails. I work for database group at my university's IT department. My main job is to take specs of a report in a docx file, copy that over to dreamweaver, fix some formatting, and put it onto their website. My issue is that it's ridiculously tedious to do this over and over. I figured, hey, I haven't written anything in C# for some time now, perhaps I could write an application to grab a docx file, convert it to HTML, fix the CSS, stick the header, and footer from the webpage on there, and save the result. I originally planned to have it do one by one, but it probably wouldn't be difficult to have it input a list of files and batch convert.
I've found these relevant topics on how to accomplish this, but they don't fit my needs well enough.
http://www.techrepublic.com/blog/howdoi/how-do-i-modify-word-documents-using-c/190
This is probably fine for a few documents, but since it's just automating an instance of Word, I feel like it'd be slow and memory intensive. I'd prefer to avoid opening and closing an instance of Word 50+ times.
http://openxmldeveloper.org/articles/333.aspx
This is what I started using. XSLT had the benefit of not needing word to be installed nor ran for each file. After some searching I got a proof of concept working. It takes in a docx file, decompresses it, grabs the document.xml from that, and uses the DocX2Html.xsl file I scavenged from OpenXML viewer. I believe that was originally provided by MS for sharepoint servers to provide the ability to render word documents in a browser. Or something along those lines.
After adjusting that code to fit my needs, and having issues with the objXSLT.Load () method, I ended up using IlMerge to make the XSL into a DLL. No idea why I kept getting a compile error when using the plain old XSL file, but the DLL worked fine, so I was satisfied. Here (http://pastebin.com/a5HBAakJ) is my current code. It does the job of converting docx to HTML just fine (other than random spaces between some words), but the result file has ridiculously ugly HTML syntax. An example of this monstrosity can be found here (http://pastebin.com/b8sPGmFE).
Does anyone know how I could remedy this? I'm thinking perhaps I need to make a new XSL file, as the one MS provided is what's responsible for sticking all those tags and extra code in there. My issue with that is that I don't know anything about how to do that. Perhaps there's an alternative version already out there. All I'd need is one that will preserve tables and text formatting. Images aren't needed.
This looks like just what you need: http://msdn.microsoft.com/en-us/library/ff628051(v=office.14).aspx
The author Eric White blogged about his experiences developing that tool. You can see that list of posts on his blog here: http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of-contents.aspx#Open_XML_to_XHtml
Since I'm a big fan of Aspose.Words, a commercial library to create/process Word documents, I would do something like:
Open the Word document with Aspose.Words.
Save the Word document as HTML.
Use something like SgmlReader or HTML Agility Pack (or even Regular Expressions if it is suitable) to remove unwanted HTML tags/attributes.
Since you wrote you work at an university, I'm not sure whether commercial packages are an option, though.
Hi not sure what the rules are on promoting your own solutions, so do let me know if I am out of line.
I am a web developer who had the same issues, so I created my own tool:
http://www.convertwordtohtml.com
We are also working on a new version that will have even better conversion quality and one click conversion eg you can right click on a word file and it will be directly converted to html and the code placed into the clipboard. The current version also supports command line access and the new version will have a server version to.
There is a free trial version downloadable from the site , and if you have any questions do contact me any time.
We store a word document in an Oracle 10g database as a BLOB object. I want to read the contents (the text) of this word document, make some changes, and write the text alone to a different field in a C# code.
How do I do this in C# 2.0?
The easiest logic that I came up with is this -
Read the BLOB object
Store it in the FileSystem
Extract the text contents
Do your job
Write the text into a separate field.
I can use Word.dll but not any commercial solutions such as Aspose
I assume that you already know how to do steps 1 and 2 (use the Oracle.DataAccess and System.IO namespaces).
For step 3 and 5, use Word Automation. This MS support article shows you how to get started: How to automate Microsoft Word to create a new document by using Visual C#
If you know what version of Word it will be, then I'd suggest using early binding, otherwise use late binding. More details and sample code here: Using early binding and late binding in Automation
Edit: If you don't know how to use BLOBs from C#, take a look here: How to: Read and Write BLOB Data to a Database Table Through an Anonymous PL/SQL Block
This keeps coming up in my searches, so I'll add an answer for the benefit of future readers.
I highly recommend avoiding Word automation. It's painfully slow and subjects you to the whims of Microsoft's developers with each upgrade. Instead, process the files manually yourselves if you can. The files are nothing but zipped archives of XML files and resources (such as images embedded in the document).
In this case, you'd simply unzip the docx using your preferred library, manipulate the XML, and then zip the result back up.
This does require the use of docx files rather than doc files, but as the link above explains, this has been the default Word format since Office 2007 and shouldn't present an issue unless your users are desperately clinging to the past.
For an example of the time savings, Back in 2007 we converted one process that took 45 minutes using Word automation and, on the same hardware, it took 15 SECONDS processing the files manually. To be clear, I'm not blaming Microsoft for this - their Word automation methods don't know how you will manipulate the document, so they have to anticipate and track everything that you could possibly change. You, on the other hand, can write your method with laser focus because you know exactly what you want to do.
The release notes of a software have some important data that I would like to extract in every release. Is there a way to extract certain information from Microsoft Word?
The application that I am thinking of would be written in C#, but I am okay if it is any other solution.
All MS Office products (Word, Office, etc.) are totally scriptable, both internally (using VBA) and externally (via OLE Automation, also known as ActiveX; in fact, VBA uses the interface exposed through OLE).
My suggestion would be to look for a library in your language that supports this. Here is a link to a Perl module, Win32::OLE, that does: as you can see, it's quite easy to use and very powerful. The interface should be similar for other languages.
I went through this a few years back. You can:
Use Word to convert the file into some other format, ASCII, RTF, XML etc.
Use some third-party app to convert to another format, such as ASCII.
Access the Word API through OLE and extract the information directly.
I couldn't find any generic libraries to read Word files, and back then all of the applications that read Word files only worked for a subset. Word changed often enough that they had trouble keeping up.
There were some documents that listed the specifics of the older Word file formats, the underlying file structure is outrageously complicated. Without a lot of resources it would be hard to keep code in sync with the file format.
Initially, I used Perl to drive Word and create new documents, but the solution was too fragile. Later I switch the whole application to work with PDFs instead, and gave up on Word.
Paul.
Probably not the most elegant solution but this seems to be the lightest method: Use a Cscript.
Just tried it on a sample word doc(2003) and it works perfectly.
More information: http://www.gregthatcher.com/Papers/VBScript/WordExtractScript.aspx
I did a lot of excel programming with the VSTO (Visual Studio Tools for Office) tools, I think you will be able to use the VSTO API to read a word doc. You should be able to use C#
You could write an IFilter to extract text from word files. No need to have Word installed.
You can work from within Word (VBA, VSTO) or outside it.
From outside it, automation is one approach.
Another is to avoid using Word entirely. If the docs are .docx, you can use anything which can manipulate an Open XML file. Microsoft has its Open XML SDK, and in the Java world you can use docx4j or POI.
I`d like to be able to read the content of office documents (for a custom crawler).
The office version that need to be readable are from 2000 to 2007. I mainly want to be crawling words, excel and powerpoint documents.
I don`t want to retrieve the formatting, only the text in it.
The crawler is based on lucene.NET if that can be of some help and is in c#.
I already used iTextSharp for parsing PDF
If you're already using Lucene.NET you might just want to take advantage of the various IFilters already available for doing this. Take a look at the open source SeekAFile project. It will show you how to use an IFilter to open and extract this information from any filetype where an IFilter is available. There are IFilters for Word, Excel, Powerpoint, PDf, and most of the other common document types.
There is an excelent open source project POI, only drawback - it is written for Java.
The .net port is somehow very beta.
Here is a good list of various tools for converting Word documents to plaintext, which you can then do whatever with.
Here's a nice little post on c-charpcorner by Krishnan LN that gives basic code to grab the text from a Word document using the Word Primary Interop assemblies.
Basically, you get the "WholeStory" property out of the Word document, paste it to the clipboard, then pull it from the clipboard while converting it to text format. The clipboard step is presumably done to strip out formatting.
For PowerPoint, you do a similar thing, but you need to loop through the slides, then for each slide loop through the shapes, and grab the "TextFrame.TextRange.Text" property in each shape.
For Excel, since Excel can be an OleDb data source, it's easiest to use ADO.NET. Here's a good post by Laurent Bugnion that walks through this technique.
You might also consider checking out DtSearch (www.DtSearch.com). Although it is primarily a searching tool, it does a great job of extracting text from a large number of file types and is considerably cheaper than other options like the Oracle/Stellent OutsideIn technology or the equivalent from Autonomy.
I've been using DtSearch for years and find it indispensible for this type of task.