Get Word ML from clipboard - c#

I am intercepting the paste event for a richtextbox, in order to process the contents before pasting. If it contains tables or images etc. I need to do some custom stuff. If the copied selection is from Word 2010 and consists of mixed content (eg. text and table/image) Word places the content on the clipboard in a number of formats. These includes HTML and RTF, but I would rather work with WordML. I've used ClipSpy to check what formats and data is actually put on the clipboard and the "Embed source" format seems to be the format containing WordML. I would think this could be opened as a Package:
var stream = Clipboard.GetData("Embed Source") as MemoryStream;
var package = Package.Open(stream);
It throws an EndOfStreamException and I'm thinking it migth be wrapped in something else. I can write the stream to disk and open it using 7-zip and see that the contents are as expected.
So basically two questions:
Is "Embed source" the right DataObject to get the WordML?
If it is, how do I deserialize it?

After saving the stream to disk and doing a binary comparison to a proper docx, I figured out that it was in fact wrapped in a Compound Document File: http://www.openoffice.org/sc/compdocfileformat.pdf. I googled the first few bytes
D0 CF 11 E0 A1 B1 1A E1
which is the identifier of the CDF format.
The package can be extracted from the Compound file using OpenMCDF.

Related

Converting binary data to .doc file with the images inside it

I have binary data stored in database which I need to convert them back for backup purposes. Most of them are .doc files with images attached in document. My method to restore them is to write binary data to byte string and write those bytes to the file like mydoc.doc. The problem is, it works for txt files and it actually works for text part of .doc file as well. Since most of the .doc files contain jpeg attached, after conversion I get some readable text and random characters which I believe are there for picture attached in doc file. Any help is appreciated. Thanks in advance...
Note: Binary data is stored in image data type in database. Database contains file path and name (which doesn't exist now) and corresponding binary data stored in image type, so from path I can detect the file type that it was before... some of them are .txt (which I was able to convert perfectly), some of them are .doc (which is problem because of attahcmens inside it)
Here is my code:
string s = "D0CF11E0A1B11AE100000000000000000000"; // note: string is for example
var bytes = GetBytesFromByteString(s).ToArray();
File.WriteAllBytes("C:\\temp\\test.doc", bytes);
A .doc file is not a string or even a text or ASCII. It is a raw binary file format.
So if your database cell contains a BLOB (Binary Large Object) simply treat it as an array of bytes and write it out to a (binary) file. No conversions, no encodings, nothing.
Edit
Whoever designed this database, they designed to store all kinds of files as an image (in the sense of memory-dump-image) i.e. a series of raw bytes in a cell of type image.
You should treat these bytes exactly as mentioned above: A series of raw bytes.

Storing an SVG as bytes in DB?

Wee bit of background to set the scene : we've told a client he can provide us with images of any type and we'll put them on his reports. I've just had a quick run through of doing this and found that the reports (and other things between me and them) are all designed to only use SVGs.
I thought I'd struck gold when I found online that you can convert an image from a jpg or PNG into an SVG using free tools but alas I've not yet succeeded in getting an SVG stored as bytes in the DB in a format that allows me to use it again after reading it back out.
Here's a quick timeline of what followed leading up to my problem.
I use an online tool to generate an SVG from a PNG (e.g., MobileFish)
I can open and view it in Firefox and it looks ok
I ultimately need the SVG to be stored as bytes in the DB, from where the report will pull it via a webpage that serves it up as SVG. So I write it as bytes into a SQL data script. The code I use to write these bytes is included below.
I visit said webpage and I get an error that there is an "XML parsing error on Line 1 Column 1" and it shows some of my bytes. They begin "3C73"
I revisit the DB and compare the bytes I've just written there with some pre-existing (and working ones). While my new ones begin "3C73", the others begin "0xFFFE".
I think I've probably just pointed out something really fundamental but it hasn't clicked.
Can someone tell me what I've done that means my bytes aren't stored in the correct encoding/format?
When I open my new SVG in Notepad++ I can see the content includes the following which could be relevant :
<image width="900" height="401" xLink:href="data:image/png;base64,
(base 64 encoded data follows for 600+ lines)
Here's the brains of the code that turns my SVG into the bytes to be stored in DB :
var bytes = File.ReadAllBytes(file);
using (var fs = new StreamWriter(file + ".txt"))
{
foreach (var item in bytes)
{
fs.Write(String.Format("{0:X2}",item));
}
}
Any help appreciated.
Cheers
Two things:
SVGs are vector images, not bitmap files. All that online tool is doing is taking a JPEG and creating a SVG file with a JPEG embedded in it. You aren't really getting the benefit of a true SVG image. If you realise and understand that, then no worries.
SVG files are just text. In theory there is no reason you can't just store them as strings in your db. As long as the column is big enough. However normally if you are storing unstructured files in a db, the preferred column type to use is a "Blob".
http://technet.microsoft.com/en-us/library/bb895234.aspx
Converting your SVG file to hex is just making things slower and doubling the size of your files. Also when you convert back, you have to be very careful about the string encoding you are using. Which, in fact, sounds like the problem you are having.
I am suspecting you are doing it incorrectly. SVG is simply and XML based vector image format. I guess your application might be using SVG image element and you need to convert your png image to base64 encoded string .

Issues modifying file attachment streams with Outlook .OFT files

I'm attempting to programmatically replace an embedded image within an OFT file (An Outlook message template), which is in Compound File Binary Format (because using anything human readable would make my life too easy).
To work with this file, I'm using OpenMCDF.
Since embedded images are basically file attachments, I can get the stream for the image like so:
static string FOOTER_IMG = "__substg1.0_37010102"; //Stream ID for embedded JPEG footer image
static string ATTACHMENT2 = "__attach_version1.0_#00000001"; //Storage ID for attached footer image
// ...
CFStream imgStream2 = file.RootStorage.GetStorage(ATTACHMENT2).GetStream(FOOTER_IMG);
I can then update that stream with the bytes from my desired image like so:
byte[] img2 = File.ReadAllBytes(footerimgFile); // New file
imgStream2.SetData(img2);
However, when I load the .OFT file in Outlook, the image no longer loads and I get a red X saying the image could not be loaded. I spent hours analyzing every bit of that OFT file, and the only thing that changed between the original template and the new template is that one stream that I replaced.
Here's where things get weird:
I noticed I could replace the bytes with the same exact bytes I had before and save it, so my saving mechanism is working. I thought maybe the OFT template stores some sort of hash of the image which has to match up. So I modified a few random bytes, and the image still loads (sometimes with some funky colors). Eventually, I realized it only breaks if the new image contains fewer bytes than the original image. I can replace the image with a larger image, and that works! I can also just pad a smaller image with trailing zeros at the end of the stream, and it still works.
This led me to come up with a true hackerific masterpiece:
if (img2.Length < 5585) img2 = img2.Concat(new byte[5585 - img2.Length]).ToArray();
Basically, if img2 is too small, I pad on enough bytes to make it the same size as the original image (5585 bytes to be exact). So this works. But.. yea.
My Question:
Does the Microsoft OFT file format store the byte count for attachments in some other stream or some other CDF container? If this was a standard property of CDF, you'd think OpenMCDF would update this count. This leads me to believe this is a property of the OFT file format, which OpenMCDF would of course know nothing about.
Why would writing a smaller stream corrupt the file, where writing a larger stream work?
Update:
From what I've read so far, the __properties_version1.0 stream contains a list of pointers (offsets?) to mark where various other streams are. I'm guessing something in here needs to be updated. Currently, I have these streams in the attachment container:
From what I can tell __properties_version1.0 doesn't change hardly at all between the first attachment (a 36,463 byte file) and the second attachment (a 5,585 byte file). The __properties_version1.0 for the second attachment is:
There's only a set of 8 bytes that change between those two attachments. In attachment 1 we have:
6F 8E 00 00 03 00 2D 00
In attachment 2 (pictured above) we have:
D1 15 00 00 03 00 6F 08
Are those offsets? Doesn't seem to be a range, or the numbers would go up. Those numbers are also way too big to be file sizes. Plus, it seems redundant to store file sizes in here anyway. So, I'm once again at a loss as to why changing the size of the 0x37010102 stream causes the image to no longer load.
Another thing that makes zero sense. I can change the size of the first attachment with either larger or smaller files, and nothing breaks. However, there's absolutely no difference between any stream in those two containers except the data in the 0x37010102 stream. Why does this approach work in one attachment and not the other?
Update 2:
I have noticed the two differences in the __properties_version1.0 stream between the two attachments do correspond to the file sizes:
6F 8E 00 00 03 00 2D 00 // Attachment 1
D1 15 00 00 03 00 6F 08 // Attachment 2
6F 8E seems to be a little-Endian representation of the file size, as 8E6F in decimal would be 36463, which is the number of bytes in the first attachment. 15D1 in decimal is 5585, the size of the second attachment. So, this stream definitely is storing file sizes. Now to see if I fix those bytes if the file becomes uncorrupted.
Update 3:
So, changing those bytes fixes a previously corrupted file, so that's the key! Now just to find a way to do this programmatically.
Are you working with embedded HTML images (which are just regular image attachments) or embedded RTF images (which OLE storage)?
Do you simply truncate a particular stream without adjusting any other properties?
Well, it's times like this I feel like an uber nerd. Here's the code that fixes the problem. Note, the byte offsets in propBytes might be different if you had other properties in the attachment.
// Fix file size on prop stream
var propStream = file.RootStorage.GetStorage(ATTACHMENT2).GetStream("__properties_version1.0");
var propBytes = propStream.GetData();
propBytes[0xb0] = (byte)(img2.Length & 0xFF);
propBytes[0xb1] = (byte)(img2.Length >> 8);
propStream.SetData(propBytes);
However, I like this solution better than padding extra zeros.
I think the real solution would be to use a third party library that deals with .MSG format, however I could not find any that don't make you install Outlook or Exchange on the server (which we can't do) or that are free (we have zero budget for this).

Matplotlib savefig to BytesIO is slightly wrong?

I'm trying to save matplotlib figures to a memory stream, exactly as in another example on SO:
import matplotlib.pyplot as plt
import io
plt.figure()
plt.plot([1, 2])
plt.title("test")
buf = io.BytesIO()
plt.savefig(buf, format = 'png')
plt.savefig("real.png", format = 'png')
buf.seek(0)
data = buf.read()
buf.close()
f = open('copy.png', 'w')
f.write(data)
f.close()
I find that copy.png is slightly larger in size and applications refuse to open it. Is this some sort of encoding issue?
Background:
I'm trying to use python.net to render graphs with matplotlib and pass them out to C# for drawing. I want to avoid writing the images to disk. Ideally, I want to write to some sort of byte array that I can work with in C#.
Try opening the file in binary mode.
f = open('copy.png', 'wb')
From the documentation:
Python on Windows makes a distinction between text and binary files;
the end-of-line characters in text files are automatically altered
slightly when data is read or written. This behind-the-scenes
modification to file data is fine for ASCII text files, but it’ll
corrupt binary data like that in JPEG or EXE files. Be very careful to
use binary mode when reading and writing such files.

Why binary data writing's result is string?

I want to create a binary file and store string data in it, I used this sample:
FileStream fs = new FileStream("c:\\test.data", FileMode.Create);
BinaryWriter bw = new BinaryWriter(fs);
bw.Write(Encoding.ASCII.GetBytes("david stein"));
bw.Close();
but when I opened created file by this sample (test.data) in notepad, it has string data in it ("david stein"), now my question is that whats the difference between this binary writing and text writing when the result is string?
I'm looking to create a data in binary file until user can not open and read my data by note pad and if user open it in notepad see real binary data .
in some files when you open theme in text editors you can not read file content like jpg files contents,they do not use any encryption methods,what about it?how can i wite my data like this?
now my question is that whats the difference between this binary writing and text writing when the result is string?
The data in a file is always "a sequence of bytes". In this case, the sequence of bytes you've written is "the bytes representing the text 'david stein'" in the ASCII encoding. So yes, if you open the file in any editor which tries to interpret the bytes as text in a way which is compatible with ASCII, you'll see the text "david stein". Really it's just a load of bytes though - it all depends on how you interpret them.
If you'd written:
File.WriteAllText("c:\\test.data", "david stein", Encoding.ASCII);
you'd have ended up with the exact same sequence of bytes. There are any number of ways you could have created a file with the same bytes in. There's nothing about File.WriteAllText which "marks" the file as text, and there's nothing about FileStream or BinaryWriter which marks the file as binary.
EDIT: From comments:
I'm looking to create a data in binary file until user can not open and read my data by note pad
Well, there are lots of ways of doing that with different levels of security. Ideally, you'd want some sort of encryption - but then the code reading the data would need to be able to decrypt it as well, which means it would need to be able to get a key. That then moves the question to "how do I store a key".
If you only need to verify the data in the file (e.g. check that it matches something from elsewhere) then you could use a cryptographic hash instead.
If you only need to prevent the most casual of snoopers, you could use something which is basically just obfuscation - a very weak form of encryption with no "key" as such. Anyone who dceompiled your code would easily be able to get at the data in that case, but you may not care too much about that.
It all depends on your requirements.
All data is binary. A text file is binary data that happens to be a limited subset that represent valid characters, but it's still binary.
The way text editors typically differentiate a text file from a binary file is they scan a certain portion of the file for zero values, \0. These never exist in text-only files and almost always exist in binary files.

Categories

Resources