Read OLE object with C#

Read OLE object with C# - c#

I have an old application made with Visual Basic that can upload and download documents as OLE objects. That is retrieved to a SQL Server database, which stores them in a varbinary(max) field. However, those bins don't have the same format as regular files, as OLE structured them in its own fashion.
I want to mass download all those docs with a .NET C# app that uses SQL, but I can't find a way to do it. I have tried to copy the binary data in a new file using SqlDataReader, MemoryStream and FileStream, but they interpreted the information straightforward and not in the way OLE structured them. So, the resulting files were corrupted.
Is there a class that can interpret properly this OLE binaries? The old app used an OLE Container component, but they don't exist since a few years.

This is an old post that never got answered. I am in a similar situation myself in that I need to take images and convert them into OLE Files so that Microsoft Access OLE Bound Frames can read them. I have been researching this for days and have just now discovered the Binary Format that I need i.e. Compound File... or Structured File. They have this Nuget Package available which should accomplish what we both need available here: OpenMCDF
I am still in the process of trying to figure out how to create a compound file from Raw Binary... but feel I am close and will post the resolution when I get it.
UPDATE: The project that I was working on was not supposed to be so involved. Thus I didn't want to waste anymore time with it after already trying different recommendations for a couple days. Thus I ended up creating a complete hack that I never hope to have to repeat. Here are the steps that I used:
I set up a form with a maximized window to display each of the images one by one via a timer interval.
Between each interval I took programmatic screen shots (Print Screen) of the images, saving them in a separate folder using an identifiable naming convention.
That portion was done all in VBA.
Once I had full screen shots in a folder, I then used PhotoShop to then crop all the images in a batch to the exact image sizes.
A hack yes, but it accomplished what I needed... what a pain!

Related

What is the standard way for dealing with PowerPoint (.PPTX) files on the server?

I've been tasked with a feature that can generate PowerPoint files on the server using C#. I'd basically start with a template, and programmatically replace some text with live data from the database. I've been doing some research into this area for the past day and here's what I've found:
PowerPoint has this sort of thing built in, meaning it can connect to external data sources and pull in data. Most examples of this, I've found, either use PowerPoint automation done on the server (I've been advised against this) or assume a SQL Server backend. Our company uses Oracle for our RDMS needs. Oracle has a solution for this called Oracle BI, but it requires a whole new web server setup to run various Java EE components and what not. I didn't look at the price, but knowing Oracle it's not cheap. It also requires new software to be installed on the end user's machine, which we really want to avoid.
Generating PowerPoint files on the fly is possible. The company that is basically the go-to guys for this problem (every help forum points to them, and they get all the rave reviews) is Aspose. They have .NET components for dealing with just about any Office format you can think of. The problem is, they are astronomically expensive. Just the PowerPoint component (a site license for up to 10 developers) would cost $3,995.
The third possibility is generating a solution in-house. After all, a PPTX file is just xml, right? Well, looking closer, a PPTX appears to be a gzip archive. It contains many folders, each containing many XML files. Modifying a PPTX file would, correct me if I'm wrong, entail unzipping the file to a temporary directory, reading the XML file and modifying the contents, then packaging up everything again and write the file out to the response stream. Perhaps there are libraries that can work with gzip streams on the fly without extracting everything.
My Question: Are there easier ways to work with a PPTX file using .NET that don't require working with compressed XML files or buying very expensive software? Basically, we need to modify a PowerPoint file, change some text, and allow the user to download that generated file from a web server.

OpenXML is Microsoft's .Net library that lets you manipulate Office documents. It lets you open a PPTX file and provides an object model that wraps the XML contents.
Here's the link to the OpenXML SDK and the MSDN documentation.
I've used OpenXML to let a ASP.Net page dynamically generate Word documents from a database.

Don't use Office Interop on a web server. It's an all-around bad idea.
If you are only replacing text placeholders for files that will not change, the home grown solution that finds the placeholders in the xml files in the gzip archive should be doable. .Net has had zip support for some time, and it is greatly improved if you are able to use .Net 4.5, so you shouldn't need to extract the archive to a temporary location at all.
PowerPoint should also support connecting directly to Oracle in the same way it supports connecting to Sql Server (just play around with the connection options), without needing the special Oracle BI stuff. However, I'd still prefer the home-grown solution, as this will only work while the powerpoint file is able to reach your database directly, which is typically only possible in your local LAN environment or with an active VPN.
If you want anything fancier than a simple text replacement, perhaps looks for an Aspose competitor.

Tagging a file with a version number

I've developed a interop Excel application which generates various reports based on a copied template. The application has to be optimized to avoid useless routines such as updating a already up to date report.
There are 2 factors that creates the need of a very specific solution.
The file may manually be modified in non-automated sections (Cannot use a file HASH or Modified date)
I cannot afford to read inside the Excel sheet for a version number since the goal is to improve processing time by skipping files.
My idea was to use the file properties (Windows' right click properties) to add in a SQL row version or data hash.
However, so far, I haven't found a clean method to acheive this.
So the question is: Is there a .NET features or a highly supported / recommanded / maintained library to manage Windows' File Properties? If not, what alternative would you guys suggest?

If the file is only ever going to stored on an NTFS volume then you can use an Alternate Data Stream.
There is a Library on Codeproject here that lets you use them from a .Net project
The only things you have to watch for is that the ADS don't survive being copied to a non NTFS volume and non ADS aware copy applications may not copy them

Reading Word Documents stored in Oracle DB as a BLOB object using C#

We store a word document in an Oracle 10g database as a BLOB object. I want to read the contents (the text) of this word document, make some changes, and write the text alone to a different field in a C# code.
How do I do this in C# 2.0?
The easiest logic that I came up with is this -
Read the BLOB object
Store it in the FileSystem
Extract the text contents
Do your job
Write the text into a separate field.
I can use Word.dll but not any commercial solutions such as Aspose

I assume that you already know how to do steps 1 and 2 (use the Oracle.DataAccess and System.IO namespaces).
For step 3 and 5, use Word Automation. This MS support article shows you how to get started: How to automate Microsoft Word to create a new document by using Visual C#
If you know what version of Word it will be, then I'd suggest using early binding, otherwise use late binding. More details and sample code here: Using early binding and late binding in Automation
Edit: If you don't know how to use BLOBs from C#, take a look here: How to: Read and Write BLOB Data to a Database Table Through an Anonymous PL/SQL Block

This keeps coming up in my searches, so I'll add an answer for the benefit of future readers.
I highly recommend avoiding Word automation. It's painfully slow and subjects you to the whims of Microsoft's developers with each upgrade. Instead, process the files manually yourselves if you can. The files are nothing but zipped archives of XML files and resources (such as images embedded in the document).
In this case, you'd simply unzip the docx using your preferred library, manipulate the XML, and then zip the result back up.
This does require the use of docx files rather than doc files, but as the link above explains, this has been the default Word format since Office 2007 and shouldn't present an issue unless your users are desperately clinging to the past.
For an example of the time savings, Back in 2007 we converted one process that took 45 minutes using Word automation and, on the same hardware, it took 15 SECONDS processing the files manually. To be clear, I'm not blaming Microsoft for this - their Word automation methods don't know how you will manipulate the document, so they have to anticipate and track everything that you could possibly change. You, on the other hand, can write your method with laser focus because you know exactly what you want to do.

Best way to store multiple revisions of a text file to a single file

I'm working on a C# application that needs to store all the successive revisions of a given report file to a single project file: each time the (plain text) report file changes, the contents of the new version shall be appended to the project file, along with some metadata. Other requirements:
each version of the report file is 100 kB to 1 MB. Theoritically, the maximum number of revisions is unlimited but it should be less than 1000 in practice.
to keep things simple, I'd like to avoid computing differences between the revisions of the report - just store the whole report to the project file every time it has changed.
the project file should be compressed - it doesn't need to be a text file
it should be easy to retrieve a given version of the report from the application
How can I implement this in an efficient way? Should I create a custom binary file, consider using a database, other ideas?
Many thanks, Guy.

What's wrong with the simple workflow?
Un-gzip file
Append header and new report
Gzip project file
Gzip is a standard format, so it's easily accessible. Subsequent reports probably won't change that much, so you'll have a great compression ratio. To file every report, just open the file and scan the headers. (If scanning doesn't work, also mirror the metadata in an SQLite database, and make sure to include offsets into the project file so you can seek to the right place quickly.)
If your requirements are flexible (e.g. that "shall append" part) and you just want something to keep track of past versions of the file, a revision control system will do all of what you need quite easily.

No need to implement that. I would suggest you to use source control. Personally I use subversion with TortoiseSVN client. There is also a plug-in that integrates Subversion with Visual Studio, VisualSVN. Have a look at them.

If using SVN is not an option, you can just store each revision in an individual file (with filename that represents date for example). You can use separate files for metadata as well. Then all the aforementioned files are zipped into one file (look at http://DotNetZip.codeplex.com/ for example).

I don't think there is much point building this yourself when there are already tens, if not hundreds, of systems that are basically designed to do exactly this - source control systems.
I'd recommend choosing some source control solution that has bindings to C# and store your document in there. Then you can easily check out any revision of the document. You will also be able to diff, branch, etc. if necessary.
To give just one example to get you started you can use Subversion with C# bindings.

You could use alternate data streams to store the old revisions of your file. There is no built-in support in the .NET framework, but there exist some helper classes and articles like here and here.
I have never used this myself, so I can't really tell if this is a good option. But it seems, it would make an elegant solution, since you could store each file version in a separate data stream and only the current version in the "main file". In any case, it will probably only work on NTFS drives.

I think that the already SVN (or another source control system) is a very good idea because source control seems to have exactly the features you require. But if that's not an option you could use a file database like SQL Server Compact Edition or SQLite.

ASP.Net Converting and Merging documents into single PDF

I need to have the ability to convert and merge various documents into a single Pdf.
The documents could be of varying types, such as Word, Open Office, Images, Text, Web pages (by URL) and the PDF would usually consist of 2-3 documents.
At the moment, we are using BCL Technologies easyPDF with Microsoft Office installed onto the Server. This handles most documents but we haven't had it doing Open Office ones yet.
We currently produce around 100-1000 of these PDF's per day.
The reason I am asking the question is that performance is a key issue. The PDF is generated for users on the fly and so the waiting times we are currently getting of 30-60 seconds is becoming unacceptable.
We have done some caching around documents when they are intially uploaded so the main tasks that happens when a User requests a Pdf is merging a number of already generated Pdf's.
Does anyone else have any other tools they have used that work reliably for most common document types and above all, quickly? When put like that, it seems like I'm asking a lot!
Edit:
Thanks for all the great advice, I'll look into some of these and compare performance.
Just to add to all this, money is not really an object. We're more than happy to pay for different applications to perform each task as well as looking into various hardware options to distribute the load as much as possible.

Merging multiple PDF documents is normally simple enough (as long as they don't need to be merged on the same page) - you could compare your merge performance with something like iTextSharp (.NET version of iText) to be sure it isn't a bottleneck - otherwise the conversion from other formats to PDF is likely the bottleneck.
In almost all cases, the method used to convert X to PDF is to execute the applications print command, targeted at a software PDF printer, to create a temporary PDF file.
This means:
The target application (for example Office) is opened and closed
The document has to travel through the printing service
In your situation, are you converting arbitrary documents submitted by the users, or do the documents come from a stored library of files? If it's a library, you could make a PDF copy of each file as it is added to the library (instead of when the user makes a request), and then only merge the PDF files.

We use ABC Pdf. I don't know if it will be fast enough for your needs, but it seems to work for our use.

I had a very similar issue where we had documents that were already existing in PDF format and needed to allow the user to see them all combined together. We purchased the PDF4NET product which was about $500 from what I recall. It was extremely easy to use and they provide awesome examples of how to use the tools.
O2 Solutions - PDF4NET
Here is the code sample that they provide for merging. The top line looks like it just outputs the file, the second 2 lines allow for streaming the content back to the user.
PDFFile.MergeFilesToDisk( "append.pdf", "unicode.pdf", "multicolumntextandimages.pdf" );
PDFDocument doc = PDFFile.MergeFilesToDoc( "append.pdf", "unicode.pdf", "multicolumntextandimages.pdf" );
doc.SaveToStream( stream );

You say you're using Microsoft Office to open these files, I would imagine this is the bottleneck rather than the actual PDF creation.
Is it possible to distill these documents into a more accessible format (html/xml/database), so that it's not necessary to open office every time a PDF needs to be created?

While I have no PDF conversion suggestions I can say that this problem sounds like one which could be distributed over a number of nodes. Do you find that the PDF generation is CPU-bound or are there other limiting factors? Before expending too much effort on rewriting the PDF library interface you might want to see what the bottlenecks are.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.