Reading an mbox file in C#

Reading an mbox file in C# - c#

One of our staff members has lost his mailbox but luckily has a dump of his email in mbox format. I need to somehow get all the messages inside the mbox file and squirt them into our tech support database (as its a custom tool there are no import tools available).
I've found SharpMimeTools which breaks down a message but not allow you to iterate through a bunch of messages in a mbox file.
Does anyone know of a decent parser thats open without having to learn the RFC to write one out?

I'm working on a MIME & mbox parser in C# called MimeKit.
It's based on earlier MIME & mbox parsers I've written (such as GMime) which were insanely fast (could parse every message in an 1.2GB mbox file in about 1 second).
I haven't tested MimeKit for performance yet, but I am using many of the same techniques in C# that I used in C. I suspect it'll be slower than my C implementation, but since the bottleneck is I/O and MimeKit is written to do optimal (4k) reads like GMime is, they should be pretty close.
The reasons you are finding your current approach to be slow (StreamReader.ReadLine(), combining the text, then passing it off to SharpMimeTools) are because of the following reasons:
StreamReader.ReadLine() is not a very optimal way of reading data from a file. While I'm sure StreamReader() does internal buffering, it needs to do the following steps:
A) Convert the block of bytes read from the file into unicode (this requires iterating over the bytes in the byte[] read from disk to convert the bytes read from the stream into a unicode char[]).
B) Then it needs to iterate over its internal char[], copying each char into a StringBuilder until it finds a '\n'.
So right there, with just reading lines, you have at least 2 passes over your mbox input stream. Not to mention all of the memory allocations going on...
Then you combine all of the lines you've read into a single mega-string. This requires another pass over your input (copying every char from each string read from ReadLine() into a StringBuilder, presumably?).
We are now up to 3 iterations over the input text and no parsing has even happened yet.
Now you hand off your mega-string to SharpMimeTools which uses a SharpMimeMessageStream which... (/facepalm) is a ReadLine()-based parser that sits on top of another StreamReader that does charset conversion. That makes 5 iterations before anything at all is even parsed. SharpMimeMessageStream also has a way to "undo" a ReadLine() if it discovers it has read too far. So it is reasonable to assume that he is scanning over some of those lines at least twice. Not to mention all of the string allocations going on... ugh.
For each header, once SharpMimeTools has its line buffer, it splits into field & value. That's another pass. We are up to 6 passes so far.
SharpMimeTools then uses string.Split() (which is a pretty good indication that this mime parser is not standards compliant) to tokenize address headers by splitting on ',' and parameterized headers (such as Content-Type and Content-Disposition) by splitting on ';'. That's another pass. (We are now up to 7 passes.)
Once it splits those it runs a regex match on each string returned from the string.Split() and then more regex passes per rfc2047 encoded-word token before finally making another pass over the encoded-word charset and payload components. We're talking at least 9 or 10 passes over much of the input by this point.
I give up going any farther with my examination because it's already more than 2x as many passes as GMime and MimeKit need and I know my parsers could be optimized to make at least 1 less pass than they do.
Also, as a side-note, any MIME parser that parses strings instead of byte[] (or sbyte[]) is never going to be very good. The problem with email is that so many mail clients/scripts/etc in the wild will send undeclared 8bit text in headers and message bodies. How can a unicode string parser possibly handle that? Hint: it can't.
using (var stream = File.OpenRead ("Inbox.mbox")) {
var parser = new MimeParser (stream, MimeFormat.Mbox);
while (!parser.IsEndOfStream) {
var message = parser.ParseMessage ();
// At this point, you can do whatever you want with the message.
// As an example, you could save it to a separate file based on
// the message subject:
message.WriteTo (message.Subject + ".eml");
// You also have the ability to get access to the mbox marker:
var marker = parser.MboxMarker;
// You can also get the exact byte offset in the stream where the
// mbox marker was found:
var offset = parser.MboxMarkerOffset;
}
}
2013-09-18 Update: I've gotten MimeKit to the point where it is now usable for parsing mbox files and have successfully managed to work out the kinks, but it's not nearly as fast as my C library. This was tested on an iMac so I/O performance is not as good as it would be on my old Linux machine (which is where GMime is able to parse similar sized mbox files in ~1s):
[fejj#localhost MimeKit]$ mono ./mbox-parser.exe larger.mbox
Parsed 14896 messages in 6.16 seconds.
[fejj#localhost MimeKit]$ ./gmime-mbox-parser larger.mbox
Parsed 14896 messages in 3.78 seconds.
[fejj#localhost MimeKit]$ ls -l larger.mbox
-rw-r--r-- 1 fejj staff 1032555628 Sep 18 12:43 larger.mbox
As you can see, GMime is still quite a bit faster, but I have some ideas on how to improve the performance of MimeKit's parser. It turns out that C#'s fixed statements are quite expensive, so I need to rework my usage of them. For example, a simple optimization I did yesterday shaved about 2-3s from the overall time (if I remember correctly).
Optimization Update: Just improved performance by another 20% by replacing:
while (*inptr != (byte) '\n')
inptr++;
with:
do {
mask = *dword++ ^ 0x0A0A0A0A;
mask = ((mask - 0x01010101) & (~mask & 0x80808080));
} while (mask == 0);
inptr = (byte*) (dword - 1);
while (*inptr != (byte) '\n')
inptr++;
Optimization Update: I was able to finally make MimeKit as fast as GMime by switching away from my use of Enum.HasFlag() and using direct bit masking instead.
MimeKit can now parse the same mbox stream in 3.78s.
For comparison, SharpMimeTools takes more than 20 minutes (to test this, I had to split the emails apart into separate files because SharpMimeTools can't parse mbox files).
Another Update: I've gotten it down to 3.00s flat via various other tweaks throughout the code.

I don't know any parser, but mbox is really a very simple format. A new email begins on lines starting with "From " (From+Space) and an empty line is attached to the end of each mail. Should there be any occurence of "From " at the beginning of a line in the email itself, this is quoted out (by prepending a '>').
Also see Wikipedia's entry on the topic.

If you can stretch to using Python, there is one in the standard library. I'm unable to find any for .NET sadly.

To read .mbox files, you can use a third-party library Aspose.Email.
This library is a complete set of Email Processing APIs to build cross-platform applications having the ability to create, manipulate, convert, and transmit emails without using Microsoft Outlook.
Please, take a look at the example I have provided below.
using(FileStream stream = new FileStream("ExampleMbox.mbox", FileMode.Open, FileAccess.Read))
{
using(MboxrdStorageReader reader = new MboxrdStorageReader(stream, false))
{
// Start reading messages
MailMessage message = reader.ReadNextMessage();
// Read all messages in a loop
while (message != null)
{
// Manipulate message - show contents
Console.WriteLine("Subject: " + message.Subject);
// Save this message in EML or MSG format
message.Save(message.Subject + ".eml", SaveOptions.DefaultEml);
message.Save(message.Subject + ".msg", SaveOptions.DefaultMsgUnicode);
// Get the next message
message = reader.ReadNextMessage();
}
}
}
It is easy to use. I hope this approach will satisfy you and other searchers.
I am working as a Developer Evangelist at Aspose.

Related

NetworkStream.Length substitute

I am using a networkstream to pass short strings around the network.
Now, on the receiving side I have encountered an issue:
Normally I would do the reading like this
see if data is available at all
get count of data available
read that many bytes into a buffer
convert buffer content to string.
In code that assumes all offered methods work as probably intended, that would look something like this:
NetworkStream stream = someTcpClient.GetStream();
while(!stream.DataAvailable)
;
byte[] bufferByte;
stream.Read(bufferByte, 0, stream.Lenght);
AsciiEncoding enc = new AsciiEncoding();
string result = enc.GetString(bufferByte);
However, MSDN says that NetworkStream.Length is not really implemented and will always throw an Exception when called.
Since the incoming data are of varying length I cannot hard-code the count of bytes to expect (which would also be a case of the magic-number antipattern).
Question:
If I cannot get an accurate count of the number of bytes available for reading, then how can I read from the stream properly, without risking all sorts of exceptions within NetworkStream.Read?
EDIT:
Although the provided answer leads to a better overall code I still want to share another option that I came across:
TCPClient.Available gives the bytes available to read. I knew there had to be a way to count the bytes in one's own inbox.

There's no guarantee that calls to Read on one side of the connection will match up 1-1 with calls to Write from the other side. If you're dealing with variable length messages, it's up to you to provide the receiving side with this information.
One common way to do this is to first work out the length of the message you're going to send and then send that length information first. On the receiving side, you then obtain the length first and then you know how big a buffer to allocate. You then call Read in a loop until you've read the correct number of bytes. Note that, in your original code, you're currently ignoring the return value from Read, which tells you how many bytes were actually read. In a single call and return, this could be as low as 1, even if you're asking for more than 1 byte.
Another common way is to decide on message "formats" - where e.g. message number 1 is always 32 bytes in length and has X structure, and message number 2 is 51 bytes in length and has Y structure. With this approach, rather than you sending the message length before sending the message, you send the format information instead - first you send "here comes a message of type 1" and then you send the message.
A further common way, if applicable, is to use some form of sentinels - if your messages will never contain, say, a byte with value 0xff then you scan the received bytes until you've received an 0xff byte, and then everything before that byte was the message you wanted to receive.
But, whatever you want to do, whether its one of the above approaches, or something else, it's up to you to have your sending and receiving sides work together to allow the receiver to discover each message.
I forgot to say but a further way to change everything around is - if you want to exchange messages, and don't want to do any of the above fiddling around, then switch to something that works at a higher level - e.g. WCF, or HTTP, or something else, where those systems already take care of message framing and you can, then, just concentrate on what to do with your messages.

You could use StreamReader to read stream to the end
var streamReader = new StreamReader(someTcpClient.GetStream(), Encoding.ASCII);
string result = streamReader.ReadToEnd();

How to read bytes from SerialPort.BaseStream without Length

I want to use the stream class to read/write data to/from a serial port. I use the BaseStream to get the stream (link below) but the Length property doesn't work. Does anyone know how can I read the full buffer without knowing how many bytes there are?
http://msdn.microsoft.com/en-us/library/system.io.ports.serialport.basestream.aspx

You can't. That is, you can't guarantee that you've received everything if all you have is the BaseStream.
There are two ways you can know if you've received everything:
Send a length word as the first 2 or 4 bytes of the packet. That says how many bytes will follow. Your reader then reads that length word, reads that many bytes, and knows it's done.
Agree on a record separator. That works great for text. For example you might decide that a null byte or a end-of-line character signals the end of the data. This is somewhat more difficult to do with arbitrary binary data, but possible. See comment.
Or, depending on your application, you can do some kind of timing. That is, if you haven't received anything new for X number of seconds (or milliseconds?), you assume that you've received everything. That has the obvious drawback of not working well if the sender is especially slow.

Maybe you can try SerialPort.BytesToRead property.

Fast text parser for server-client messages - MMO

Application
I am working on an MMO and have run into an issue. The MMO server I've built is fast and locally sending messages to my game client every ~50 milliseconds on UDP sockets. Here is an example of a message from sever to client through my current message system:
count={2}body={[t={agt}id={42231}pos={[50.40142117456183,146.3123192153775]}rot={200.0},t={agt}id={4946}pos={[65.83051652925558,495.25839757504866]}rot={187.0}}
count={2}, 2 = number of objects
[,] = array of objects
I built a simple text parser, code: http://tinypaste.com/af3fb928
I use the message like :
int objects = int.Parse(UTL.Parser.DecodeMessage("count", message));
string body = UTL.Parser.DecodeMessage("body", message);
for (int i = 0; i < objects; i++)
{
string objectStr = UTL.Parser.DecodeMessage("[" + i + "]", body);
// parse objecStr with UTL.Parser.DecodeMessage to extract pos & rot and apply to objects
)
Issue
When I have more objects ~ 60+ the performance dramatically decreases.
Question
What is standard method for packaging and reading message between clients and server in MMOs or real-time online games?

A binary protocol would be a lot faster here. Take the coordinates that you pass in for example. When you transport those as bytes, they will take up 8 bytes per axis, whereas the string representation uses 2 bytes per character (unless you transport as ASCII but even then a binary double would be smaller).
Then actually turning that string into a number is a lot more work; first a substring has to be created, incurring garbage collection overhead later on, the number has to be parsed which isn't fast (but to be fair, you can still parse hundreds of thousands of doubles a second) because .NET's double.Parse is very general and has to accommodate a lot of different formats. You would gain noticeable speed by writing your own double parser in case you stick with text, but like I said, you ought to go with binary if message parsing is a bottleneck. Turning bytes into doubles (with a little bit of unsafe magic, which should be fine if you're running a server and should have Full Trust mode) is a matter of copying 8 bytes.
If your messages are mostly just coordinates and such, I think you could gain a lot by creating your own binary format.

One of your problems might be unnecessary string instantiation. Any time you do a concatenation, split, or get a substring, you are creating new string instances on the stack, which are copied from the original string and need to be collected afterwards.
Try to change your code to iterate through the string character by character, and parse the data using the original string only. You should only use indexing, indexOf and possibly even write your own int and float parsers which accept a string offset + length to avoid creating substrings at all. I am not sure if this is overkill, but it's not "premature" optimization if you have hard evidence that it works slow.
Also, did you try Protocol buffers? I believe their performance should be pretty good (just write a small console app for benchmark). Or with JSON, that's a standard concise format (but I have no clue about how optimized Json.NET is). Nothing should usually beat a hard coded specialized parser in terms of performance, but with future maintenance in mind, I would try one of these protocols before anything else.

I'd look into using an RPC framework like Thrift to do the lifting for you. It'll do the packing and parsing for you, and send stuff over the wire in binary so it's more efficient.
There are a bunch of other options, too. Here is a comparison of some.

how to encode data for iso 8583 to transfer socket c#

I don't understand exactly how to send data over c# socket.send( byte[]),
I mean they say I need to send 0800 (Network Management Request) for an echo test, how to convert.
Please I've been programming for a while but I don't understand the instructions.
Thanks

First of all you need to have an understanding of the ISO8583 message format. For echo test messages, in the 87 revision, your MTID should be 0800 and field 70, the Network Management Information code, should be set to 301, indicating echo test.
Building an ISO message is quite tricky. (shameless plug coming up) I have released OpenIso8583.Net, a message parser/builder for .Net which can be extended into the particular flavor of ISO you are using

You should first understand the spec that you're working to; I expect you have something more specific than the bare ISO8583 message spec, something that is specific about the fields required and the content. The important thing is the way you build and deblock the ISO8583 fields from to and from the message based on the bitmap that specifies which fields are present.
When I've built ISO8583 test clients in C# in the past I first put together a set of classes that could build and deblock a message bitmap. Once you have that you need some code to build and deblock your messages. These will set (or test) bits in the bitmap and then extract or insert the expected fields into a byte buffer.
Once you have this working the actual sending and receiving of the byte buffer messages is trivial.

Looking at the spec, it would be impossible to provide a full answer here - but to get you started, you basically need to create the various messages and send them down the pipe. As the Socket class requires an array of bytes rather than a string, you can use one of the Encoding classes to get at the raw bytes of a string. If I am reading the info correctly from Wikipedia:
byte[] echo = System.Text.Encoding.ASCII.GetBytes("0100");
socket.Send(echo);
Disclaimer: I have never had to implement ISO 8583 and the reference I looked at wasn't clear if the codes were in fact simple ASCII characters (though I am betting they are). Hopefully someone more familiar with the standard will clarify or confirm that assumption.

//***************additional encoders***************
UnicodeEncoding encoderUnicode = new UnicodeEncoding();
UTF32Encoding encoder32 = new UTF32Encoding();
UTF7Encoding encoder7 = new UTF7Encoding();
UTF8Encoding encoder8 = new UTF8Encoding();
//*************end of additionals*************
ASCIIEncoding encoder = new ASCIIEncoding();
about the parser, do some google on iso 8583, preferably of 2003

How to write a file format handler

Today i'm cutting video at work (yea me!), and I came across a strange video format, an MOD file format with an companion MOI file.
I found this article online from the wiki, and I wanted to write a file format handler, but I'm not sure how to begin.
I want to write a file format handler to read the information files, has anyone ever done this and how would I begin?
Edit:
Thanks for all the suggestions, I'm going to attempt this tonight, and I'll let you know. The MOI files are not very large, maybe 5KB in size at most (I don't have them in front of me).

You're in luck in that the MOI format at least spells out the file definition. All you need to do is read in the file and interpret the results based on the file definition.
Following the definition, you should be able to create a class that could read and interpret a file which returns all of the file format definitions as properties in their respective types.
Reading the file requires opening the file and generally reading it on a byte-by-byte progression, such as:
using(FileStream fs = File.OpenRead(path-to-your-file)) {
while(true) {
int b = fs.ReadByte();
if(b == -1) {
break;
}
//Interpret byte or bytes here....
}
}
Per the wiki article's referenced PDF, it looks like someone already reverse engineered the format. From the PDF, here's the first entry in the format:
Hex-Address: 0x00
Data Type: 2 Byte ASCII
Value (Hex): "V6"
Meaning: Version
So, a simplistic implementation could pull the first 2 bytes of data from the file stream and convert to ASCII, which would provide a property value for the Version.
Next entry in the format definition:
Hex-Address: 0x02
Data Type: 4 Byte Unsigned Integer
Value (Hex):
Meaning: Total size of MOI-file
Interpreting the next 4 bytes and converting to an unsigned int would provide a property value for the MOI file size.
Hope this helps.

If the files are very large and just need to be streamed in, I would create a new reader object that uses an unmanagedmemorystream to read the information in.
I've done a lot of different file format processing like this. More recently, I've taken to making a lot of my readers more functional where reading tends to use 'yield return' to return read only objects from the file.
However, it all depends on what you want to do. If you are trying to create a general purpose format for use in other applications or create an API, you probably want to conform to an existing standard. If however you just want to get data into your own application, you are free to do it however you want. You could use a binaryreader on the stream and construct the information you need within your app, or get the reader to return objects representing the contents of the file.
The one thing I would recommend. Make sure it implements IDisposable and you wrap it in a using!

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.