Parsing and extracting HTML and Text messages from raw .eml file

Parsing and extracting HTML and Text messages from raw .eml file - c#

I did this before, but the source codes were in a flash memory which one failed. So I'm rewriting the project. I've got archived .eml files on disk and I need to extract HTML message if any. If not, I'll extract text message.
I don't remember what 3rd party used before. But parsing was so easy like below:
oMessage.Raw = LoadFile(fileEml);
msgHTML = oMessage.HTML;
msgText = oMessage.Text;
Can anyone recognize that 3rd party library?

I don't know what parsing library you were using before, but the best parser out there right now is MimeKit. It's significantly faster and more RFC-compliant than any other parser.
I'm also working on a mail client library (SMTP, POP3 and IMAP) called MailKit.
(Disclosure: I wrote MimeKit and MailKit after spending time looking at all of the Open Source alternatives out there and finding that all of them were really bad, and often the authors of said libraries blamed the other mail software when it was a bug in their own code and they would have realized that if they actually bothered to read the RFCs.)

Related

.msg to .eml file using Aspose.Email missing Calendar data

I am trying to convert from .msg to .eml file format using Aspose.Email for .NET. Pretty trivial:
var msg = MapiMessage.FromFile(#"example.msg");
MailMessageInterpretor mmi = MailMessageInterpretorFactory.Instance.GetIntepretor(msg.MessageClass);
MailMessage eml = mmi.Interpret(msg);
eml.Save(#"example.eml");
If its a calendar invite, I see it in the .msg file and also in the .eml file, as shown below (.msg on top, .eml on the bottom):
It also includes the meeting .ics file within the .eml it produces:
Content-Type: text/calendar; method="REQUEST"; name="meeting.ics";
charset="utf-8" Content-Transfer-Encoding: base64
However, the problem is that if it is a meeting cancellation instead of an invitation:
The resulting .eml as you can see, does not include meeting details, nor does the resulting .eml contain any trace of a meeting .ics (which does exist in the .msg). I can parse the .msg for it easily:
MapiCalendar calendar = (MapiCalendar)msg.ToMapiMessageItem();
Why is it not base-64 encoding the .ics for cancellations in the resulting .eml? Am I doing something wrong? Is it an Aspose bug? Is it normal behavior? What's going on here?

Could you please share your input/output files with us on Aspose.Email forum? We need to understand what steps are you taking that gives raise to this issue and will further assist you in the forum. If we find it to be a bug with the API, we'll log it for investigation by our Product team.
I work with Aspose as Developer Evangelist.
Edit:
Please try mmi.InterpretAsTnef(msg). Though it should work with Interpret as well, but it seems an issue with the API. We're investigating the problem at our end, however, you can use the InterpretAsTnef meanwhile. You may register on www.Aspose.com free of cost. Our basic support is free for all whether it is paid or non-paid users.

FO.NET XSL-FO Capabilities

I'm going to try to ask a very broad question here, not related to any specific code, but rather to an expected outcome. This is to see if anyone can answer with some certainity that FO.NET will be able to produce this outcome for me.
My goal: to port a service generating a PDF based on an XML document from Java (Using Apache FOP) to C#. This is to make it easier in a setup with only using IIS as a host.
Where I'm at: I have a working wcf service that gets the XML document and transforms it to xsl-fo and returns to the browser a PDF. What's left to do is to fix the styling so that it matches the previous PDF generated by Java Apache FOP. I'm currently using FO.NET and I'm hoping I don't have to redo everything but if that's the case then so be it.
The XSL Stylesheet is using SVG for creating figures, importing images etc and I know that this is not supported in FO.NET but maybe there is a workaround. The images can be converted to another file format but the shapes might be trickier.
Expected outcome (as it is with the current service): http://imgur.com/INkzvdo
Current outcome: http://imgur.com/Y5dbb3X
Question: Can this be done using FO.NET? If not, Is there any other Open Source Lib that I can use that is better suited or do I have to solve this in another way?
The reason I'm trying to use XSL-FO is that we already have three stylesheets defining the output PDF (There will be three different PDF Outputs) and it would be nice not having to redo everything using for instance CSS.

I'm going to answer this myself, partly based on the response I recieved, partly on my own experience with using FO.NET. I Don't recommend using C# for PDF's if xsl-fo is used for styling. If C# is a requirement I would investigate CSS for this purpose instead. If C# is not a requirement, I would recommend using Java and Apache FOP and host it here. If you, as in my case, need to host multiple services on the same outgoing port, you can use Application Request Routing in IIS (Addon I think) to serve as a reverse proxy for your Java Servlet. Thanks for your feedback #mzjn!

Reading Excel files without a library

I am trying to figure out how excel files are written, so I can read and edit them in C# without a library (because I like to make work for myself like that). When opening them in notepad, all you can see is strange wingding characters, so immediately I thought, I might get some results when reading in a byte array. No luck, I do get a sensibly sized byte array, but converting it to a string ends up with a useless result!
So I guess my question is, how are excel files written, and how can I read them without a library?

A good place to start would be this project : Excel Data Reader CodePlex
It has working code, and you'll be able to learn the older format "xls". "XLSX" is really more the new OpenXML document standard. A link to an SDK from Microsoft for that:
Open XML for Office Developers (Microsoft)
I wanted to learn the formatting myself for some mono work I've been doing.. the older file format "XLS" isn't exactly something that you'd "learn". Pretty complicated if you ask me. Ended up just waiting for a mono product to appear rather than buying source from a number of vendors.. and porting to mono.

Batch conversion of docx to clean HTML

I'm starting to wonder if this is even possible. I've searched for solutions on Google and come up with nothing that works exactly how I'd like it to.
I think it'd benefit to explain what that entails. I work for database group at my university's IT department. My main job is to take specs of a report in a docx file, copy that over to dreamweaver, fix some formatting, and put it onto their website. My issue is that it's ridiculously tedious to do this over and over. I figured, hey, I haven't written anything in C# for some time now, perhaps I could write an application to grab a docx file, convert it to HTML, fix the CSS, stick the header, and footer from the webpage on there, and save the result. I originally planned to have it do one by one, but it probably wouldn't be difficult to have it input a list of files and batch convert.
I've found these relevant topics on how to accomplish this, but they don't fit my needs well enough.
http://www.techrepublic.com/blog/howdoi/how-do-i-modify-word-documents-using-c/190
This is probably fine for a few documents, but since it's just automating an instance of Word, I feel like it'd be slow and memory intensive. I'd prefer to avoid opening and closing an instance of Word 50+ times.
http://openxmldeveloper.org/articles/333.aspx
This is what I started using. XSLT had the benefit of not needing word to be installed nor ran for each file. After some searching I got a proof of concept working. It takes in a docx file, decompresses it, grabs the document.xml from that, and uses the DocX2Html.xsl file I scavenged from OpenXML viewer. I believe that was originally provided by MS for sharepoint servers to provide the ability to render word documents in a browser. Or something along those lines.
After adjusting that code to fit my needs, and having issues with the objXSLT.Load () method, I ended up using IlMerge to make the XSL into a DLL. No idea why I kept getting a compile error when using the plain old XSL file, but the DLL worked fine, so I was satisfied. Here (http://pastebin.com/a5HBAakJ) is my current code. It does the job of converting docx to HTML just fine (other than random spaces between some words), but the result file has ridiculously ugly HTML syntax. An example of this monstrosity can be found here (http://pastebin.com/b8sPGmFE).
Does anyone know how I could remedy this? I'm thinking perhaps I need to make a new XSL file, as the one MS provided is what's responsible for sticking all those tags and extra code in there. My issue with that is that I don't know anything about how to do that. Perhaps there's an alternative version already out there. All I'd need is one that will preserve tables and text formatting. Images aren't needed.

This looks like just what you need: http://msdn.microsoft.com/en-us/library/ff628051(v=office.14).aspx
The author Eric White blogged about his experiences developing that tool. You can see that list of posts on his blog here: http://blogs.msdn.com/b/ericwhite/archive/2008/10/20/eric-white-s-blog-s-table-of-contents.aspx#Open_XML_to_XHtml

Since I'm a big fan of Aspose.Words, a commercial library to create/process Word documents, I would do something like:
Open the Word document with Aspose.Words.
Save the Word document as HTML.
Use something like SgmlReader or HTML Agility Pack (or even Regular Expressions if it is suitable) to remove unwanted HTML tags/attributes.
Since you wrote you work at an university, I'm not sure whether commercial packages are an option, though.

Hi not sure what the rules are on promoting your own solutions, so do let me know if I am out of line.
I am a web developer who had the same issues, so I created my own tool:
http://www.convertwordtohtml.com
We are also working on a new version that will have even better conversion quality and one click conversion eg you can right click on a word file and it will be directly converted to html and the code placed into the clipboard. The current version also supports command line access and the new version will have a server version to.
There is a free trial version downloadable from the site , and if you have any questions do contact me any time.

How can I send emails that preserve the formatting that a user provides in RTF in a RichTextBox?

What I'm trying to do is provide a form where a user can type or cut and past formatted text and be able to send it as an email (similar to outlook). This is required because it's closely resembles the current work flow and these emails aren't being saved anywhere besides people's inboxes. This is obviously a bandage on a bigger problem.
My current attempt has a RichTextBox that can receive RTF that is copy and pasted but when I try to send the email, it seems that the only options are plain text and HTML. After investigating options for an RTF to HTML library, it seems that they all cost at least $300 but after reviewing how difficult it would be to write the library myself, the money and time is better spent getting a third party option. I'm curious if there is a solution to this problem (sending an email with formatted text) without bringing in a third party library.

Most email clients can't display email in RTF, and that's just how it is. You can't change the email clients.
So, you need to send the email in HTML. There's no built-in WinForms control to export formatted text in HTML, unfortunately, so there's no way to accomplish this without third-party code.

You need an RTF to HTML converter. You're right, it may not be worth your time to write one. I did anyway. It wasn't too bad because I had some control over the RTF document creation and could prohibit things that I didn't want to translate to HTML. Converting RTF to HTML is basically just a document parser with the ability to replace RTF command verbs with their HTML equivalents.

I ended up finding a free solution:
http://www.dreamincode.net/forums/showtopic48398.htm
It's not a perfect translation but it's better than any of the pay packages out there.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.