I just managed to rewrite some code that should now manage to process large MIME files meant for printing.
For this we used MimeKit in the following manner:
var message = MimeMessage.Load(_fileStream, true);
var iter = new MimeIterator(message);
while (iter.MoveNext())
{
...
}
What is interesting here is that before I used fileStream, we used to copy the inputStream of a HttpListenerRequest object, which is a NetworkStream (non-seekable) which can not be loaded to MimeKit.
Since adding fileStream, for a 2GB job, I spend around 40 seconds inside the while loop iterating for each part of the request which is a lot if you ask me... with memoryStream it was 6 seconds.
Is this behavior normal? Is there ca chance to get better times with MimeKit or should I implement my own parser?
Thanks in advance!
What is interesting here is that before I used fileStream, we used to copy the inputStream of a HttpListenerRequest object, which is a NetworkStream (non-seekable) which can not be loaded to MimeKit.
I'm not quite sure what you mean above. Were you doing something like this?
var memory = new MemoryStream ();
httpResponse.Stream.CopyTo (memory);
memory.Position = 0;
var message = MimeMessage.Load(memory, true);
Since adding fileStream, for a 2GB job, I spend around 40 seconds inside the while loop iterating for each part of the request which is a lot if you ask me... with memoryStream it was 6 seconds.
Your use of MimeMessage.Load() passes true as the persistent argument. This improves parsing performance (because it no longer needs to load MIME content into RAM after parsing), but negatively impacts performance of reading the content of individual MIME parts later because it needs to seek to the _fileStream starting offset for the content of each MIME part as you request it.
Is this behavior normal?
40 seconds does seem like a lot, but without knowing more details of your code, I can't make any suggestions.
Is there a chance to get better times with MimeKit or should I implement my own parser?
Well, I can pretty much guarantee that if you implement your own parser, it will very likely be orders of magnitude slower than MimeKit (seeing as how literally every other attempt to write a MIME parser in C# is orders of magnitude slower than MimeKit) ;-)
What you need to do here is run your code under a profiler to see where the problem is.
If the performance problem is as you suggest inside of the loop, it is NOT a performance problem in the parser. It is a performance problem within your loop. That's not to say that MimeKit code doesn't play a role in the slowness.
If the problem is in MimeKit, I would suspect this logic inside of BoundStream.Read():
// make sure that the source stream is in the expected position
if (BaseStream.Position != StartBoundary + position)
BaseStream.Seek (StartBoundary + position, SeekOrigin.Begin);
In your case, BaseStream should be your _fileStream. It's been a while since I've looked at the code for FileStream.Position, but it may require a syscall to get that value (and syscalls are not free).
40 seconds, though, is a LOT of time - way more than I would expect unless each MIME part's content is massive (streaming content of a MimePart.Content will read ~4K per loop and if there is 1 lseek() per loop, that can add up).
Hope that helps you figure this out.
Related
I am researching the possibility of using pipelines for processing binary messages coming from network.
The binary messages i will be processing come with an payload and it is desirable to keep the payload in its binary form.
The idea is to read out the whole message and create a slice of message and its payload, once the message is completely read it will be passed to a channel chain for processing, the processing will not be instant and might take some time or be executed later and the goal is not to have the pipe reader wait until the processing is complete, then once the message processing is complete i would need to release the processed buffer region to the pipe writer.
Now of course i could just create a new byte array and copy the data coming from pipe writer but that would beat the purpose of no-copy? So as i understand i would need some buffer synchronization between the pipeline and the channel?
I observed the available apis (AdvanceTo) of pipe reader where its possible to tell the pipe reader what was consumed and what was examined but cant get around how this could be synced outside of the pipe reading method.
So the question would be whether there are some techniques or examples on how this can be achieved.
The buffer obtained from TryRead/ReadAsync is only valid until you call AdvanceTo, with the expectation that as soon as you've done that: anything you reported as consumed is available to be recycled for use elsewhere (which could be parallel/concurrent readers). Strictly speaking: even the bits you haven't reported as consumed: you still shouldn't treat as valid once you've called AdvanceTo (although in reality, it is likely that they'll still be the same segments - just: that isn't the concern of the caller; to the caller, it is only valid between the read and the advance).
This means that you explicitly can't do:
while (...)
{
var result = await pipe.ReadAsync();
if (TryIdentifyFrameBoundary(out var frame)) {
BeginProcessingInBackground(frame); // <==== THIS IS A PROBLEM!
reader.AdvanceTo(frame.End, frame.End);
}
else if { // take nothing
reader.AdvanceTo(buffer.Start, buffer.End);
if (result.IsCompleted) break; // that's all folks
}
}
because the "in background" bit, when it fires, could now be reading someone else's data (due to it being reused already).
So: either you need to process the frame contents as part of the read loop, or you're going to have to make a copy of the data, most likely by using:
c#
var len = checked ((int)buffer.Length);
var oversized = ArrayPool<byte>.Shared.Rent(len);
buffer.CopyTo(oversized);
and pass oversized to your background processing, remembering to only look at the first len bytes of it. You could pass this as a ReadOnlyMemory<byte>, but you need to consider that you're also going to want to return it to the array-pool afterwards (probably in a finally block), and passing it as a memory makes it a little more awkward (but not impossible, thanks to MemoryMarshal.TryGetArray).
Note: in early versions of the pipelines API, there was an element of reference-counting, which did allow you to preserve buffers, but it had a few problems:
it complicated the API hugely
it led to leaked buffers
it was ambiguous and confusing what "preserved" meant; is the count until it gets reused? or released completely?
so that feature was dropped.
I need to process an XML file with the following structure:
<FolderSizes>
<Version></Version>
<DateTime Un=""></DateTime>
<Summary>
<TotalSize Bytes=""></TotalSize>
<TotalAllocated Bytes=""></TotalAllocated>
<TotalAvgFileSize Bytes=""></TotalAvgFileSize>
<TotalFolders Un=""></TotalFolders>
<TotalFiles Un=""></TotalFiles>
</Summary>
<DiskSpaceInfo>
<Drive Type="" Total="" TotalBytes="" Free="" FreeBytes="" Used=""
UsedBytes=""><![CDATA[ ]]></Drive>
</DiskSpaceInfo>
<Folder ScanState="">
<FullPath Name=""><![CDATA[ ]]></FullPath>
<Attribs Int=""></Attribs>
<Size Bytes=""></Size>
<Allocated Bytes=""></Allocated>
<AvgFileSz Bytes=""></AvgFileSz>
<Folders Un=""></Folders>
<Files Un=""></Files>
<Depth Un=""></Depth>
<Created Un=""></Created>
<Accessed Un=""></Accessed>
<LastMod Un=""></LastMod>
<CreatedCalc Un=""></CreatedCalc>
<AccessedCalc Un=""></AccessedCalc>
<LastModCalc Un=""></LastModCalc>
<Perc><![CDATA[ ]]></Perc>
<Owner><![CDATA[ ]]></Owner>
<!-- Special element; see paragraph below -->
<Folder></Folder>
</Folder>
</FolderSizes>
The <Folder> element is special in that it repeats within the <FolderSizes> element but can also appear within itself; I reckon up to about 5 levels.
The problem is that the file is really big at a whopping 11GB so I'm having difficulty processing it - I have experience with XML documents, but nothing on this scale.
What I would like to do is to import the information into a SQL database because then I will be able to process the information in any way necessary without having to concern myself with this immense, impractical file.
Here are the things I have tried:
Simply load the file and attempt to process it with a simple C# program using an XmlDocument or XDocument object
Before I even started I knew this would not work, as I'm sure everyone would agree, but I tried it anyway, and ran the application on a VM (since my notebook only has 4GB RAM) with 30GB memory. The application ended up using 24GB memory, and taking very, very long, so I just cancelled it.
Attempt to process the file using an XmlReader object
This approach worked better in that it didn't use as much memory, but I still had a few problems:
It was taking really long because I was reading the file one line at a time.
Processing the file one line at a time makes it difficult to really work with the data contained in the XML because now you have to detect the start of a tag, and then the end of that tag (hopefully), and then create a document from that information, read the info, attempt to determine which parent tag it belongs to because we have multiple levels... Sound prone to problems and errors
Did I mention it takes really long reading the file one line at a time; and that still without actually processing that line - literally just reading it.
Import the information using SQL Server
I created a stored procedure using XQuery and running it recursively within itself processing the <Folder> elements. This went quite well - I think better than the other two approaches - until one of the <Folder> elements ended up being rather big, producing a An XML operation resulted an XML data type exceeding 2GB in size. Operation aborted. error. I read up about it and I don't think it's an adjustable limit.
Here are more things I think I should try:
Re-write my C# application to use unmanaged code
I don't have much experience with unmanaged code, so I'm not sure how well it will work and how to make it as unmanaged as possible.
I once wrote a little application that works with my webcam, receiving the image, inverting the colours, and painting it to a panel. Using normal managed code didn't work - the result was about 2 frames per second. Re-writing the colour inversion method to use unmanaged code solved the problem. That's why I thought that unmanaged might be a solution.
Rather go for C++ in stead of C#
Not sure if this is really a solution. Would it necessarily be better that C#? Better than unmanaged C#?
The problem here is that I haven't actually worked with C++ before, so I'll need to get to know a few things about C++ before I can really start working with it, and then probably not very efficiently yet.
I thought I'd ask for some advice before I go any further, possibly wasting my time.
Thanks in advance for you time and assistance.
EDIT
So before I start processing the file I run through it and check the size in a attempt to provide the user with feedback as to how long the processing might take; I made a screenshot of the calculation:
That's about 1500 lines per second; if the average line length is about 50 characters, that's 50 bytes per line, that's 75 kilobytes per second, for an 11GB file should take about 40 hours, if my maths is correct. But this is only stepping each line. It's not actually processing the line or doing anything with it, so when that starts, the processing rate drops significantly.
This is the method that runs during the size calculation:
private int _totalLines = 0;
private bool _cancel = false; // set to true when the cancel button is clicked
private void CalculateFileSize()
{
xmlStream = new StreamReader(_filePath);
xmlReader = new XmlTextReader(xmlStream);
while (xmlReader.Read())
{
if (_cancel)
return;
if (xmlReader.LineNumber > _totalLines)
_totalLines = xmlReader.LineNumber;
InterThreadHelper.ChangeText(
lblLinesRemaining,
string.Format("{0} lines", _totalLines));
string elapsed = string.Format(
"{0}:{1}:{2}:{3}",
timer.Elapsed.Days.ToString().PadLeft(2, '0'),
timer.Elapsed.Hours.ToString().PadLeft(2, '0'),
timer.Elapsed.Minutes.ToString().PadLeft(2, '0'),
timer.Elapsed.Seconds.ToString().PadLeft(2, '0'));
InterThreadHelper.ChangeText(lblElapsed, elapsed);
if (_cancel)
return;
}
xmlStream.Dispose();
}
Still runnig, 27 minutes in :(
you can read an XML as a logical stream of elements instead of trying to read it line-by-line and piece it back together yourself. see the code sample at the end of this article
also, your question has already been asked here
I have a socket-based application that exposes received data with a BinaryReader object on the client side. I've been trying to debug an issue where the data contained in the reader is not clean... i.e. the buffer that I'm reading contains old data past the size of the new data.
In the code below:
System.Diagnostics.Debug.WriteLine("Stream length: {0}", _binaryReader.BaseStream.Length);
byte[] buffer = _binaryReader.ReadBytes((int)_binaryReader.BaseStream.Length);
When I comment out the first line, the data doesn't end up being dirty (or, doesn't end up being dirty as regularly) as when I have that print line statement. As far as I can tell, from the server side the data is coming in cleanly, so it's possible that my socket implementation has some issues. But does anyone have any idea why adding that print line would cause the data to be dirty more often?
Your binary reader looks like it is a private member variable (if the leading underscore is a tell tell sign).
Is your application multithreaded? You could be experiencing a race condition if another thread is attempting to do also use your binaryReader while you are reading from it. The fact that you experience issues even without that line seems quite suspect to me.
Are you sure that your reading logic is correct? Stream.Length indicates the length of the entire stream, not of the remaining data to be read.
Suppose that, initially, 100 bytes were available. Length is 100, and BinaryReader corrects reads 100 bytes and advances the stream position by 100. Then, another 20 bytes arrive. Length is now 120; however, your BinaryReader should only be reading 20 bytes, not 120. The ‘extra’ 100 bytes requested in the second read would either cause it to block or (if the stream is not implemented correctly) break.
The problem was silly and unrelated. I believe my reading logic above is correct, however. The issue was that the _binaryReader I was using was a reference that was not owned by my class and hence the underlying stream was being rewritten with bad data.
I'm now blocked by this problem the entire day, read thousands of google results, but nothing seems to reflect my problem or even come near to it... i hope any of you has a push into the right direction for me.
I wrote a client-server-application (so more like 2 applications) - the client collects data about his system, as well as a screenshot, serializes all this into a XML stream (the picture as a byte[]-array]) and sends this to the server in regular intervals.
The server receives the stream (via tcp), deserializes the xml to an information-object and shows the information on a windows form.
This process is running stable for about 20-25 minutes at a submission interval of 3 seconds. When observing the memory usage there's nothing significant to see, also kinda stable. But after these 20-25 mins the server throws a StackOverflowException at the point where it deserializes the tcp-stream, especially when setting the Image property from the byte[]-array.
I thoroughly searched for recursive or endless loops, and regarding the fact that it occurs after thousands of sucessfull intervals, i could hardly imagine that.
public byte[] ImageBase
{
get
{
MemoryStream ms = new MemoryStream();
_screen.Save(ms, System.Drawing.Imaging.ImageFormat.Jpeg);
return ms.GetBuffer();
}
set
{
if (_screen != null) _screen.Dispose(); //preventing well-known image memory leak
MemoryStream ms = new MemoryStream(value);
try
{
_screen = Image.FromStream(ms); //<< EXCEPTION THROWING HERE
}
catch (StackOverflowException ex) //thx to new CLR management this wont work anymore -.-
{
Console.WriteLine(ex.Message + Environment.NewLine + ex.StackTrace);
}
ms.Dispose();
ms = null;
}
}
I hope that more code would be unnecessary, or it could get very complex...
Please help, i have no clue at all anymore
thx
Chris
I suspect that it's not the code you posted but the code that reads from the TCP stream that's growing the stack. The fact that the straw breaking the camel's back happens during Image.FromStream is probably irrelevant. I've oftentimes seen people write socket processing code containing self-calling code (sometimes indirectly, like A -> B -> A -> B). You should inspect that code and post it here for us to look at.
You might want to read this. Loading an image from a stream without keeping the stream open
It seem possible that streams are being maintained on the stack or some other object that is eventually blowing the stack.
My suggestion would be to then just hold onto the byte[] and wait until the last possible moment to decode it and draw it. then dispose of the Image immediately. Your get/set would just then set/get the byte[]. You would then implement a custom drawing routine that would decode the current byte[] and draw it making sure not to hold onto any more resources than necessary.
Update
If there is a way you can get us a full stack trace we might be able to help further. I'm beginning to think it isn't the problem like I described. I created a sample program that created 10,000 images just like you do in your setter and there wasn't a problem. If you send an image every 3 seconds, that's 30 images a minute times 20 minutes which is only 600 images.
I'm very interested in the solution to this. I'll come back to it later.
There are a few possibilities, however, remote.
Image.FromStream is trying to process an invalid/corrupt byte[] and that method somehow uses recursion to decode a bitmap. Highly unlikely.
The exception isn't being thrown where you think it is. A full stack trace, if possible would be very helpful. As you stated you cannot catch a StackOverflowException. I believe there are provisions for this if you are running it through the debugger though.
I'm not sure if it's relevant but the MSDN documentation for Image.FromStream states that
You must keep the stream open for the lifetime of the Image.
I am creating a downloading application and I wish to preallocate room on the harddrive for the files before they are actually downloaded as they could potentially be rather large, and noone likes to see "This drive is full, please delete some files and try again." So, in that light, I wrote this.
// Quick, and very dirty
System.IO.File.WriteAllBytes(filename, new byte[f.Length]);
It works, atleast until you download a file that is several hundred MB's, or potentially even GB's and you throw Windows into a thrashing frenzy if not totally wipe out the pagefile and kill your systems memory altogether. Oops.
So, with a little more enlightenment, I set out with the following algorithm.
using (FileStream outFile = System.IO.File.Create(filename))
{
// 4194304 = 4MB; loops from 1 block in so that we leave the loop one
// block short
byte[] buff = new byte[4194304];
for (int i = buff.Length; i < f.Length; i += buff.Length)
{
outFile.Write(buff, 0, buff.Length);
}
outFile.Write(buff, 0, f.Length % buff.Length);
}
This works, well even, and doesn't suffer the crippling memory problem of the last solution. It's still slow though, especially on older hardware since it writes out (potentially GB's worth of) data out to the disk.
The question is this: Is there a better way of accomplishing the same thing? Is there a way of telling Windows to create a file of x size and simply allocate the space on the filesystem rather than actually write out a tonne of data. I don't care about initialising the data in the file at all (the protocol I'm using - bittorrent - provides hashes for the files it sends, hence worst case for random uninitialised data is I get a lucky coincidence and part of the file is correct).
FileStream.SetLength is the one you want. The syntax:
public override void SetLength(
long value
)
If you have to create the file, I think that you can probably do something like this:
using (FileStream outFile = System.IO.File.Create(filename))
{
outFile.Seek(<length_to_write>-1, SeekOrigin.Begin);
OutFile.WriteByte(0);
}
Where length_to_write would be the size in bytes of the file to write. I'm not sure that I have the C# syntax correct (not on a computer to test), but I've done similar things in C++ in the past and it's worked.
Unfortunately, you can't really do this just by seeking to the end. That will set the file length to something huge, but may not actually allocate disk blocks for storage. So when you go to write the file, it will still fail.