Efficiently streaming data across process boundaries in .NET

Efficiently streaming data across process boundaries in .NET - c#

I've been working on an internal developer tool on and off for a few weeks now, but I'm running into an ugly stumbling block I haven't managed to find a good solution for. I'm hoping someone can offer some ideas or guidance on the best ways to use the existing frameworks in .NET.
Background: the purpose of this tool is to load multiple different types of log files (Windows Event Log, IIS, SQL trace, etc.) to the same database table so they can be sorted and examined together. My personal goal is to make the entire thing streamlined so that we only make a single pass and do not cache the entire log either in memory or to disk. This is important when log files reach hundreds of MB or into the GB range. Fast performance is good, but slow and unobtrusive (allowing you to work on something else in the meantime) is better than running faster but monopolizing the system in the process, so I've focused on minimizing RAM and disk usage.
I've iterated through a few different designs so far trying to boil it down to something simple. I want the core of the log parser--the part that has to interact with any outside library or file to actually read the data--to be as simple as possible and conform to a standard interface, so that adding support for a new format is as easy as possible. Currently, the parse method returns an IEnumerable<Item> where Item is a custom struct, and I use yield return to minimize the amount of buffering.
However, we quickly run into some ugly constraints: the libraries provided (generally by Microsoft) to process these file formats. The biggest and ugliest problem: one of these libraries only works in 64-bit. Another one (Microsoft.SqlServer.Management.Trace TraceFile for SSMS logs) only works in 32-bit. As we all know, you can't mix and match 32- and 64-bit code. Since the entire point of this exercise is to have one utility that can handle any format, we need to have a separate child process (which in this case is handling the 32-bit-only portion).
The end result is that I need the 64-bit main process to start up a 32-bit child, provide it with the information needed to parse the log file, and stream the data back in some way that doesn't require buffering the entire contents to memory or disk. At first I tried using stdout, but that fell apart with any significant amount of data. I've tried using WCF, but it's really not designed to handle the "service" being a child of the "client", and it's difficult to get them synchronized backwards from how they want to work, plus I don't know if I can actually make them stream data correctly. I don't want to use a mechanism that opens up unsecured network ports or that could accidentally crosstalk if someone runs more than one instance (I want that scenario to work normally--each 64-bit main process would spawn and run its own child). Ideally, I want the core of the parser running in the 32-bit child to look the same as the core of a parser running in the 64-bit parent, but I don't know if it's even possible to continue using yield return, even with some wrapper in place to help manage the IPC. Is there any existing framework in .NET that makes this relatively easy?

WCF does have a P2P mode however if all your processes are local machine you are better off with IPC such as named pipes due to the latter running in Kernel Mode and does not have the messaging overhead of the former.
Failing that you could try COM which should not have a problem talking between 32 and 64 bit processes. - Tell me more

In case anyone stumbles across this, I'll post the solution that we eventually settled on. The key was to redefine the inter-process WCF service interface to be different from the intra-process IEnumerable interface. Instead of attempting to yield return across process boundaries, we stuck a proxy layer in between that uses an enumerator, so we can call a "give me an item" method over and over again. It's likely this has more performance overhead than a true streaming solution, since there's a method call for every item, but it does seem to get the job done, and it doesn't leak or consume memory.
We did follow Micky's suggestion of using named pipes, but still within WCF. We're also using named semaphores to coordinate the two processes, so we don't attempt to make service calls until the "child service" has finished starting up.

Related

C# Restricting DLL's to only one instance

I essentially want to make an api for an application but I only want one instance of that dll to be running at one time.
So multiple applications also need to be able to use the DLL at the same time. As you would expect from a normal api.
However I want it to be the same instance of the dll that the different applications use. This is because of communication with hardware that I don't want to be able to overlap.

DLLs are usually loaded once per process, so if your application is guaranteed to only be running in single-instance mode, there's nothing else you have to do. Your single application instance will have only one loaded DLL.
Now, if you want to "share" a "single instance" of a DLL across applications, you will inevitably have to resort to a client-server architecture. Your DLL will have to be wrapped in a Windows Service, which would expose an HTTP (or WCF) API.

You can't do that as you intend to do. The best way to do this would be having a single process (a DLL is not a process) which receives and processes messages, and have your multiple clients use an API (this would be your DLL) that just sends messages to this process.
The intercommunication of those two processes (your single process and the clients sending or receiving the messages via your API) could be done in many ways, choose the one that suits you better (basically, any kind of client/server architecture, even if the clients and the server are running on the same hardware)

This is an XY-Problem type of question. Your actual requirement is serializing interactions with the underlying hardware, so they do not overlap. Perhaps this is what you should explicitly and specifically be asking about.
Your proposed solution is to have a DLL that is kind of an OS-wide singleton or something like that. This is actually what you are asking about; although it is still not the right approach, in my opinion. The OS is in charge of managing the lifetime of the DLL modules in each process. There are many aspects to this, but for one: most DLL instances are already being shared between every process (mostly code sections, resources and such - data, of course, is not shared by default).
To solve your actual problem, you would have to resort to multi-process synchronization techniques. In Windows, this works mostly through named kernel objects like mutexes, semaphores, events and such. Another approach would be to use IPC, as other folks have already mentioned in their respective answers, which then again would require in itself some kind of synchronization.
Maybe all this is already handled by that hardware's device driver. What would be the real scenarios in which overlapped interactions with the underlying hardware would have a negative impact on the applications that use your DLL?

To ensure you have loaded one DLL per machine, you would need to run a controlling assembly in separate AppDomain, then try creating named pipe for remoting (with IpcChannel) and claim hardware resources. IpcChannel will fail to create second time in the same environment. If you need high performance communication with your hardware, use remoting only for claiming and releasing resource by another assembly used by applications.

Mutex is one of solution for exclusive control of multiple processes.
***But Mutex will sometimes occur dead lock. Be careful if you use.

Pass large amounts of data between app domains quickly

I have an application used to import a large dataset (millions of records) from one database to another, doing a diff in the process (IE removing things that were deleted, updating things, etc). Due to many foreign key constraints and such and to try to speed up the processing of the application, it loads up the entire destination database into memory and then tries to load up parts of the source database and does an in-memory compare, updating the destination in memory as it goes. In the end it writes these changes back to the destination. The databases do not match one to one, so a single table in one may be multiple tables in the other, etc.
So to my question: it currently takes hours to run this process (sometimes close to a day depending on the amount of data added/changed) and this makes it very difficult to debug. Historically, when we encounter a bug, we have made a change, and then rerun the app which has to load all of the data into memory again (taking quite some time) and then run the import process until we get to the part we were at and then we cross our fingers and hope our change worked. This isn't fun :(
To speed up the debugging process I am making an architectural change by moving the import code into a separate dll that is loaded into a separate appdomain so that we can unload it, make changes, and reload it and try to run a section of the import again, picking up where we left off, and seeing if we get better results. I thought that I was a genius when I came up with this plan :) But it has a problem. I either have to load up all the data from the destination database into the second appdomain and then, before unloading, copy it all to the first using the [Serializable] deal (this is really really slow when unloading and reloading the dll) or load the data in the host appdomain and reference it in the second using MarshalByRefObject (which has turned out to make the whole process slow it seems)
So my question is: How can I do this quickly? Like, a minute max! I would love to just copy the data as if it was just passed by reference and not have to actually do a full copy.
I was wondering if there was a better way to implement this so that the data could better be shared between the two or at least quickly passed between them. I have searched and found things recommending the use of a database (we are loading the data in memory to AVOID the database) or things just saying to use MarshalByRefObject. I'd love to do something that easy but it hasn't really worked yet.
I read somewhere that loading a C++ dll or unmanaged dll will cause it to ignore app domains and could introduce some problems. Is there anyway I could use this to my advantage, IE, load an unmanaged dll that holds my list for me or something, and use it to trick my application into using the same memory are for both appdomains so that the lists just stick around when I unload the other dll by unloading the app domain?
I hope this makes sense. It's my first question on here so if I've done a terrible job do help me out. This has frustrated me for a few days now.

App domains approach is a good way of separating for the sake of loading/unloading only part of your application. Unfortunately, as you discovered, exchanging data between two app domains is not easy/fast. It is just like two different system processes trying to communicate which will always be slower than the same process communication. So the way to go is to use quickest possible inter process communication mechanism. Skip WCF as it ads overhead you do not need here. Use named pipes through which you can stream data very fast. I have used it before with good results. To go even faster you can try MemoryMappedFile (link) but that's more difficult to implement. Start with named pipes and if that is too slow go for memory mapped files.
Even when using fast way of sending, you may hit another bottleneck - data serialization. For large amounts of data, standard serialization (even binary) is very slow. You may want to look at Google's protocol buffers.
One word of caution on AppDomain - any uncaught exception in one of the app domains brings the whole process down. They are not that separated, unfortunately.
On the side note. I do not know what your application does but millions of records does not seem that excessive. Maybe there is a room for optimization?

You didn't say if it were SQL Server, but did you look at using SSIS for doing this? There are evidently some techniques that can make it fast with big data.

C# communication between processes

I'm working with an application, and I am able to make C# scripts to run in this environment. I can import DLLs of any kind into this environment. My problem is that I'd like to enable communication between these scripts. As the environment is controlled and I have no access to the source code of the application, I'm at a loss as to how to do this.
Things I've tried:
File I/O: Just writing the messages that I would like each to read in .txt files and having the other read it. Problem is that I need this scripts to run quite quickly and that took up too much time.
nServiceBus: I tried this, but I just couldn't get it to work in the environment that I'm dealing with. I'm not saying it can't be done, just that I can't get it done.
Does anyone know of a simple way to do this, that is also pretty fast?

Your method of interprocess communication should depend on how important it is that each message get processed.
For instance, if process A tells process B to, say, send an email to your IT staff saying that a server is down, it's pretty important.
If however you're streaming audio, individual messages (packets) aren't critical to the performance of the app, and can be dropped.
If the former, you should consider using persistent storage such as a database to store messages, and let each process poll the database to retrieve its own messages. In this way, if a process is terminated or loses communication with the other processes temporarily, it will be able to retrieve whatever messages it has missed when it starts up again.

The answer is simple;
Since you can import any DLL into the script you may create a custom DLL that will implement communication between the processes in any way you desire: shared memory, named pipe, TCP/UDP.

You could use a form of Interprocess Communication, even within the same process. Treat your scripts as separate processes, and communicate that way.
Named pipes could be a good option in this situation. They are very fast, and fairly easy to use in .NET 3.5.
Alternatively, if the scripts are loaded into a single AppDomain, you could use a static class or singleton as a communication service. However, if the scripts get loaded in isolation, this may not be possible.

Well, not knowing the details of your environment, there is not much I can really offer. You are using the term "C# scripts"...I am not exactly sure what that means, as C# is generally a compiled language.
If you are using normal C#, have you looked into WCF with Named Pipes? If your assemblies are running on the same physical machine, you should be able to easily and quickly create some WCF services hosted with the Named Pipe binding. Named pipes provide a simple, efficient, and quick message transfer mechanism in a local context. WCF itself is pretty easy to use, and is a native component of the .NET framework.

Since you already have the File I/O in place you might get enough speed by placing it on a RAM disk. If you are polling for changes today a FileSystemWatcher could help to get your communication more responsive.

You can use PipeStream. Which are fast than disk IO as they are done using main memory.

XMPP/Jabber is another appraoch take a look at jabber.net.

Another easy way is to open a TCP Socket on a predefined Port, connect to it from the other process and communicate that way.

Please help me with a program for virus detection using detection of malicious behavior

I know how antivirus detects viruses. I read few aticles:
How do antivirus programs detect viruses?
http://www.antivirusworld.com/articles/antivirus.php
http://www.agusblog.com/wordpress/what-is-a-virus-signature-are-they-still-used-3.htm
http://hooked-on-mnemonics.blogspot.com/2011/01/intro-to-creating-anti-virus-signatures.html
During this one month vacation I'm having. I want to learn & code a simple virus detection program:
So, there are 2-3 ways (from above articles):
Virus Dictionary : Searching for virus signatures
Detecting malicious behavior
I want to take the 2nd approach. I want to start off with simple things.
As a side note, recently I encountered a software named "ThreatFire" for this purpose. It does a pretty good job.
1st thing I don't understand is how can this program inter vent an execution of another between and prompt user about its action. Isnt it something like violation?
How does it scan's memory of other programs? A program is confined to only its virtual space right?
Is C# .NET correct for doing this kind of stuff?
Please post your ideas on how to go about it? Also mention some simple things that I could do.

This happens because the software in question likely has a special driver installed to allow it low level kernel access which allows it to intercept and deny various potentially malicious behavior.
By having the rights that many drivers do, this grants it the ability to scan another processes memory space.
No. C# needs a good chunk of the operating system already loaded. Drivers need to load first.
Learn about driver and kernel level programming. . . I've not done so, so I can't be of more help here.

I think system calls are the way to go, and a lot more doable than actually trying to scan multiple processes' memory spaces. While I'm not a low-level Windows guy, it seems like this can be accomplished using Windows API hooks- tie-ins to the low-level API that can modify system-wide response to a system call. These hooks can be installed as something like a kernel module, and intercept and potentially modify system calls. I found an article on CodeProject that offers more information.
In a machine learning course I took, a group decided to try something similar to what you're describing for a semester project. They used a list of recent system calls made by a program to determine whether or not the executing program was malicious, and the results were promising (think 95% recognition on new samples). In their project, they trained using SVMs on windowed call lists, and used that to determine a good window size. After that, you can collect system call lists from different malicious programs, and either train on the entire list, or find what you consider "malicious activity" and flag it. The cool thing about this approach (aside from the fact that it's based on ML) is that the window size is small, and that many trained eager classifiers (SVM, neural nets) execute quickly.
Anyway, it seems like it could be done without the ML if it's not your style. Let me know if you'd like more info about the group- I might be able to dig it up. Good luck!

Windows provides APIs to do that (generally the involve running at least some of your code in kernel). If you have sufficient privileges, you can also inject a .dll into other process. See http://en.wikipedia.org/wiki/DLL_injection.
When you have the powers described above, you can do that. You are either in kernel space and have access to everything, or inside the target process.
At least for the low-level in-kernel stuff you'd need something more low-level than C#, like C or C++. I'm not sure, but you might be able to do some of the rest things in a C# app.
The DLL injection sounds like the simplest starting point. You're still in user space, and don't have to learn how to live in the kernel world (it's completely different world, really).
Some loose ideas on topic in general:
you can interpose system calls issued by the traced process. It is generally assumed that a process cannot do anything "dangerous" without issuing a system call.
you can intercept its network traffic and see where it connects to, what does it send, what does it receive, which files does it touch, which system calls fail
you can scan its memory and simulate its execution in a sandbox (really hard)
with the system call interposition, you can simulate some responses to the system calls, but really just sandbox the process
you can scan the process memory and extract some general characteristics from it (connects to the network, modifies registry, hooks into Windows, enumerates processes, and so on) and see if it looks malicious
just put the entire thing in a sandbox and see what happens (a nice sandbox has been made for Google Chrome, and it's open source!)

.NET IPC without having a service mediator

I have two unrelated processes that use .NET assemblies as plugins. However, either process can be started/stopped at any time. I can't rely on a particular process being the server. In fact, there may be multiple copies running of one of the processes, but only one of the other.
I initially implemented a solution based off of this article. However, this requires the one implementing the server to be running before the client.
Whats the best way to implement some kind of notification to the server when the client(s) were running first?

Using shared memory is tougher because you'll have to manage the size of the shared memory buffer (or just pre-allocate enough). You'll also have to manually manage the data structures that you put in there. Once you have it tested and working though, it will be easier to use and test because of its simplicity.
If you go the remoting route, you can use the IpcChannel instead of the TCP or HTTP channels for a single system communication using Named Pipes. http://msdn.microsoft.com/en-us/library/4b3scst2.aspx. The problem with this solution is that you'll need to come up with a registry type solution (either in shared memory or some other persistent store) that processes can register their endpoints with. That way, when you're looking for them, you can find a way to query for all the endpoints that are running on the system and you can find what you're looking for. The benefits of going with Remoting are that the serialization and method calling are all pretty straightforward. Also, if you decide to move to multiple machines on a network, you could just flip the switch to use the networking channels instead. The cons are that Remoting can get frustrating unless you clearly separate what are "Remote" calls from what are "Local" calls.
I don't know much about WCF, but that also might be worth looking into. Spider sense says that it probably has a more elegant solution to this problem... maybe.
Alternatively, you can create a "server" process that is separate from all the other processes and that gets launched (use a system Mutex to make sure more than one isn't launched) to act as a go-between and registration hub for all the other processes.
One more thing to look into the Publish-Subscribe model for events (Pub/Sub). This technique helps when you have a listener that is launched before the event source is available, but you don't want to wait to register for the event. The "server" process will handle the event registry to link up the publishers and subscribers.

Why not host the server and the client on both sides, and whoever comes up first gets to be the server? And if the server drops out, the client that is still active switches roles.

There are many ways to handle IPC (.net or not) and via a TCP/HTTP tunnel is one way...but can be a very bad choice (depending on circumstances and enviornment).
Shared memory and named pipes are two ways (and yes they can be done in .Net) that might be better solutions for you. There is also the IPC class in the .Net Framework...but I personally don't like them due to some AppDomain issues...

I agree with Garo.
Using a pub/sub service would be a great solution. This obviously means that this service would need to be up and running before either of the other two.
If you want to skip the pub/sub you can just implement the service in both applications with different end points. When either of the applications is launched it tries to access the other known object via the IPC proxy. If the proxy fails, the other object isn't up.
-Scott

I've spent 2 days meandering through all the options available for IPC while looking for a reliable, simple, and fast way to do full-duplex IPC. IPCLibrary, which I found on Codeplex.com, is so far working perfectly out of all the options that I tried. All with only 7 lines of code. :D If anyone stumbles across this trying to find a full-duplex IPC, save yourself a ton of time and give this library a try. Grab the source code, compile the data.dll and follow the examples given.
HTH,
Circ

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.