virtual temp file, omit IO operations - c#

Let's say I received a .csv-File over network,
so I have a byte[].
I also have a parser that reads .csv-files and does business things with it,
using File.ReadAllLines().
So far I did:
File.WriteAllBytes(tempPath, incomingBuffer);
parser.Open(tempPath);
I won't ever need the actual file on this device, though.
Is there a way to "store" this file in some virtual place and "open" it again from there, but all in memory?
That would save me ages of waiting on the IO operations to complete (good article on that on coding horror),
plus reducing wear on the drive (relevant if this occured a few dozen times a minute 24/7)
and in general eliminating a point of failure.
This is a bit in the UNIX-direction, where everything is a file-stream, but we're talking windows here.

I won't ever need the actual file on this device, though. - Well, you kind of do if all your API's expect file on the disk.
You can:
1) Get decent API's(I am sure there are CSV parsers that take Stream as construtor parameter - you then can possibly use MemoryStream, for example.)
2) If performance is serious issue, and there is no way you can handle the API's, there's one simple solution: write your own implementation of ramdisk, which will cache everything that is needed, and page stuff to hdd if necessary.
http://code.msdn.microsoft.com/windowshardware/RAMDisk-Storage-Driver-9ce5f699 (Oh did I mention that you absolutely need to have mad experience with drivers :p?)
There's also "ready" solutions for ramdisk(Google!), which means you can just run(in your application initializer) 'CreateRamDisk.exe -Hdd "__MEMDISK__"'(for example), and use File.WriteAllBytes("__MEMDISK__:\yourFile.csv");
Alternatively you can read about memory-mapped files(>= C# 4.0 has nice support). However, by the sounds of it, that probably does not help you too much.

Related

Way to obtain full directory information in a batch

We’ve got a process that obtains a list of files from a remote transatlantic Samba share. This is naturally on the slow side, however it’s made worse by the fact that we don’t just need the names of the files, we need the last write times to spot updates. There’s quite a lot of files in the directory, and as far as I can tell, the .NET file API insists on me asking for each one individually. Is there a faster way of obtaining the information we need?
I would love to find a way myself. I have exactly the same problem - huge number of files on a slow network location, and I need to scan for changes.
As far as I know, you do need to ask for file properties one by one.
The amount of information transferred per file should not be high though; the roundabout request-response time is probably the main problem. You can help the situation by running multiple requests in parallel (e.g. using Parallel.ForEach)
The answer to your question is most likely no, at least not in a meaningful way.
Exactly how you enumerate the files in your code is almost irrelevant since they all boil down to the same file system API in Windows. Unfortunately, there is no function that returns a list of file details in one call*.
So, no matter what your code looks like, somewhere below, it's still enumerating the directory contents and calling a particular file function individually for each file.
If this is really a problem, I would look into moving the detection logic closer to the files and send your app the results periodically.
*Disclaimer: It's been a long time since I've been down this far in the stack and I'm just browsing the API docs now, there may be a new function somewhere that does exactly this.

Efficiently streaming data across process boundaries in .NET

I've been working on an internal developer tool on and off for a few weeks now, but I'm running into an ugly stumbling block I haven't managed to find a good solution for. I'm hoping someone can offer some ideas or guidance on the best ways to use the existing frameworks in .NET.
Background: the purpose of this tool is to load multiple different types of log files (Windows Event Log, IIS, SQL trace, etc.) to the same database table so they can be sorted and examined together. My personal goal is to make the entire thing streamlined so that we only make a single pass and do not cache the entire log either in memory or to disk. This is important when log files reach hundreds of MB or into the GB range. Fast performance is good, but slow and unobtrusive (allowing you to work on something else in the meantime) is better than running faster but monopolizing the system in the process, so I've focused on minimizing RAM and disk usage.
I've iterated through a few different designs so far trying to boil it down to something simple. I want the core of the log parser--the part that has to interact with any outside library or file to actually read the data--to be as simple as possible and conform to a standard interface, so that adding support for a new format is as easy as possible. Currently, the parse method returns an IEnumerable<Item> where Item is a custom struct, and I use yield return to minimize the amount of buffering.
However, we quickly run into some ugly constraints: the libraries provided (generally by Microsoft) to process these file formats. The biggest and ugliest problem: one of these libraries only works in 64-bit. Another one (Microsoft.SqlServer.Management.Trace TraceFile for SSMS logs) only works in 32-bit. As we all know, you can't mix and match 32- and 64-bit code. Since the entire point of this exercise is to have one utility that can handle any format, we need to have a separate child process (which in this case is handling the 32-bit-only portion).
The end result is that I need the 64-bit main process to start up a 32-bit child, provide it with the information needed to parse the log file, and stream the data back in some way that doesn't require buffering the entire contents to memory or disk. At first I tried using stdout, but that fell apart with any significant amount of data. I've tried using WCF, but it's really not designed to handle the "service" being a child of the "client", and it's difficult to get them synchronized backwards from how they want to work, plus I don't know if I can actually make them stream data correctly. I don't want to use a mechanism that opens up unsecured network ports or that could accidentally crosstalk if someone runs more than one instance (I want that scenario to work normally--each 64-bit main process would spawn and run its own child). Ideally, I want the core of the parser running in the 32-bit child to look the same as the core of a parser running in the 64-bit parent, but I don't know if it's even possible to continue using yield return, even with some wrapper in place to help manage the IPC. Is there any existing framework in .NET that makes this relatively easy?
WCF does have a P2P mode however if all your processes are local machine you are better off with IPC such as named pipes due to the latter running in Kernel Mode and does not have the messaging overhead of the former.
Failing that you could try COM which should not have a problem talking between 32 and 64 bit processes. - Tell me more
In case anyone stumbles across this, I'll post the solution that we eventually settled on. The key was to redefine the inter-process WCF service interface to be different from the intra-process IEnumerable interface. Instead of attempting to yield return across process boundaries, we stuck a proxy layer in between that uses an enumerator, so we can call a "give me an item" method over and over again. It's likely this has more performance overhead than a true streaming solution, since there's a method call for every item, but it does seem to get the job done, and it doesn't leak or consume memory.
We did follow Micky's suggestion of using named pipes, but still within WCF. We're also using named semaphores to coordinate the two processes, so we don't attempt to make service calls until the "child service" has finished starting up.

How to make subsequent instances of an assembly share the same memory?

I want something like a static class variable, except when different applications load my assembly I want them all to be sharing the same variable.
I know I could write to disk or to a database, but this is for a process that's used with sql queries and that would probably slow it down too much (actually I am going to test these options out but I'm asking this question in the meantime b/c I don't think it's going to be an acceptable solution).
I would prefer to use the solution that incurrs the least overhead in deployment, and I don't mind if the solution isn't easy to create so long as it's easy to use when I'm done.
I'm aware that there are some persistent memory frameworks out there. I haven't checked any of them out yet and maybe one of them would be perfect so feel free to recommend one. I am also perfectly content to write something myself, particularly if it makes deployment easier for me to do so.
Thanks in advance for any and all suggestions!
Edit: Looks like I was overlooking a really easy solution. My problem involved SQL only providing 8000 bytes of space to serialize data between calls to a SQL aggregate function I wrote. I read an article on how to compress your data and get the most out of that 8000 bytes, and assumed there was nothing more I could do. As it turns out, I can set the MaxBytes = -1 instead of a range between 0 to 8000 to get up to 2gb of space. I believe that this was something new they added in the 3.5 framework because there are various articles out there talking about this 8000 byte limitation.
Thank you all for you answers though as this is a problem I've wanted to solve for other reasons in the past and now I know what to do if I need a really easy and fast way to communicate between apps.
You can't store this as in-memory data and have it shared between processes, since each process has it's own isolated memory address space.
One option, however, would be to use the .NET Memory-mapped file support to "store" the shared data. This would allow you to write a file that contained the information in a place that every process could access.
Each process has its own address space. You cannot simply share a variable like you intend
You can use shared memory though.
If you are on .NET 4, you can simply use Memory-Mapped Files
If you want some sort of machine-wide count or locking you can look into use of named synchronization objects like semaphore - http://msdn.microsoft.com/en-us/library/z6zx288a.aspx or mutexes http://msdn.microsoft.com/en-us/library/hw29w7t1.aspx. When name is specified such objects are machine-wide instead of process-wide.

Most efficient way to search for files

I am writing a program that searches and copies mp3-files to a specified directory.
Currently I am using a List that is filled with all the mp3s in a directory (which takes - not surprisingly - a very long time.) Then I use taglib-sharp to compare the ID3Tags with the artist and title entered. If they match I copy the file.
Since this is my first program and I am very new to programming I figure there must be a better/more efficient way to do this. Does anybody have a suggestion on what I could try?
Edit: I forgot to add an important detail: I want to be able to specify what directories should be searched every time I start a search (the directory to be searched will be specified in the program itself). So storing all the files in a database or something similar isn't really an option (unless there is a way to do this every time which is still efficient). I am basically looking for the best way to search through all the files in a directory where the files are indexed every time. (I am aware that this is probably not a good idea but I'd like to do it that way. If there is no real way to do this I'll have to reconsider but for now I'd like to do it like that.)
You are mostly saddled with the bottleneck that is IO, a consequence of the hardware with which you are working. It will be the copying of files that is the denominator here (other than finding the files, which is dwarfed compared to copying).
There are other ways to go about file management, and each exposing better interfaces for different purposes, such as NTFS Change Journals and low-level sector handling (not recommended) for example, but if this is your first program in C# then maybe you don't want to venture into p/invoking native calls.
Other than alternatives to actual processes, you might consider mechanisms to minimise disk access - i.e. not redoing anything you have already done, or don't need to do.
Use an database (simple binary serialized file or an embedded database like RavenDb) to cache all files. And query that cache instead.
Also store modified time for each folder in the database. Compare the time in the database with the time on the folder each time you start your application (and sync changed folders).
That ought to give you much better performance. Threading will not really help searching folders since it's the disk IO that takes time, not your application.

fs.write(); fs.flush(); When is it really written to disk? What if kernel-panic or power outage?

I need to implement some atomic writes to secondary storage. How can I make this fool proof?
If I open a C# file handle using File.Open I will receive a handle. I can write some data to it. Flush it and close it. But I still have some questions. I guess the below statements are true?
Data might not be written to disk but rather exist in the Windows Disk cache
Data might not be written to disk but rather exist in the HDD cache
And this will lead to the following issues:
A power outage will make the edits in the file I made reverted (On a transactional FS like NTFS)
A kernel-panic will make the edits in the file I made reverted (On a transactional FS like NTFS)
Am I correct in my assumptions? If so, how can I make a fool proof write to the disk?
I have looked a little bit into NoSQL and have been thinking there might be a nosql server that could talk to the system closer to the hardware and not return to me software until it can guarantee that the bytes are written to disk.
All ideas and thoughts are welcome
Jens
[edit]
Maybe there is a specific amount of time I can wait before being sure that all changes are written to physical disk?
The only way to make an operation fully "fool proof" is queue, run the operation, and confirm. Things stay in queue and can be run again, until confirmed, or if the confirmation is negative "rolled back".
The window of time you are talking about, assuming you are not involving a network (everything is local), is very small. Still, if you want to ensure things, you queue them. MSMQ is one option. If the data comes from SQL Server, you can consider it's queueing mechanism Service Broker (not recommending this direction, but it is one way).
Ultimately, the idea here is a lot like a handshake, as used in most server to server communication. Everyone agrees things are done before both sides get rid of their piece of the work.
I am not an expert in Windows internals, but I believe you are correct. I didn't test it in great detail, but was able to use MSMQ as a pretty reliable place to store data, with another process that monitored the queue for final processing.

Categories

Resources