I'm going to create some command line tools that make use of some large library DLL's. For security reasons I plan to embed the DLL's in the command line's EXE.
Example:
Suppose the CL's (command line tool) functionality is to just copy a file from A to B. The procedure to do this is included in a 100MB library DLL. If I would just take out the lines of code from the DLL and paste them in the CL's code then the CL would only be 10Kb.
But I don't want to do that, so I embed the full library in the CL's EXE, which will make it 101MB in size.
Please be aware that the above is just an example. I once read somewhere (cannot remember where) that Windows would only use the part of the EXE that's actually used. So if that's true then it shouldn't matter if the EXE size is 10Kb, 100MB or 1GB. I don't know if that is true, so that's why I'm asking this question.
I own the code of the DLL, so if the best solution is to not include the whole DLL but just only link to or include those code files, of the DLL project, that are used by the CL then I will go that way.
So the question is: will the 10Kb CL run faster and consume less memory than the 101MB CL?
First of all, if you're embedding the extra dll into the executable for security reasons, then don't.
If the program can unpack it, anyone else cans, so you're only fooling yourself if you think this will improve security, unless it is job security you're talking about.
Secondly, I suspect the underlying question here is quite a bit harder to answer than others might think.
If you had used a regular non-managed executable and a non-managed dll, then portions of those files would be reserved in memory space when you start the program, but only the actual bits of it you use will be mapped into physical memory.
In other words, the actual amount of physical memory the program would consume would be somewhat proportional to the amount of code you use from them and how that code is spread out. I say "somewhat" because paging into physical memory is done on a page basis, and pages have a fixed size. So a 100 byte function might end up mapping a 4KB of 8KB page (I don't recall the sizes of the pages any more) into memory. If you call a lot of such functions, and they're spread out over the address space of the code, you might end up mapping in a lot of physical memory for little code.
When it comes to managed assemblies, the picture changes somewhat. Managed code isn't mapped directly into physical memory the same way (note, I'm fuzzy on the details here because I don't know the exact details) because the code is not ready to run. Before it can be run it has to be JITted, and the JITter only JITs code on a need-to-jit basis.
In other words, if you include a humongous class in the assembly, but never use it, it might never end up being JITted and thus not use any memory.
So is the answer "no", as in it won't use more memory?
I have no idea. I suspect that there will be some effect of a larger assembly, more metadata to put into reflection tables or whatnot.
However, if you intend to place it into the executable, you either need to unpack it to disk before loading it (which would "circumvent" your "security features"), or unpack it into memory (which would require all of those 100MB of physical memory.)
So if you're concerned about using a lot of memory, here's my advice:
Try it, see how bad it is
Decide if it is worth it
And don't embed the extra assembly into the executable
Will the smaller one run faster and consume less memory? Yes.
Will it be enough to make a difference? Who knows? If done wrong, the big one might take up about 100MB more memory (three guesses where I got that amount from)
But it sure seems awfully silly to include 100MB of 'stuff' that isn't needed...
EDIT: My "Yes" at the top here should be qualified with "infinitesimally so", and incidentally so. See comments, below.
Related
I have a C# application that will continously allocate memory for data stored in byte arrays. I have another process written in python that will read from these arrays once instantiated. Both processes will be running on a ubuntu machine.
The obvious solution seems to be to share memory between the processes by passing a pointer from the C# process to the python process. However, this has turned out to be difficult.
I've mainly looked at solutions proposed online. Two notable ones are named pipes and mapped memory files. I read the following posts:
Sharing memory between C and Python. Suggested to be done via named pipes:
Share memory between C/C++ and Python
The C# application will neither read nor write from the array and the python script will only read from the array. Therefore, this solution doesn't satisfy my efficiency requirements and seems to be a superfluous solution when the data is literally stored in memory.
When i looked at memory mapped files, it seemed as if though that we would allocate memory for these memory files to write the data to. However, the data will already be allocated before the mapped file is used. Thus, it seems inefficient as well.
The second post:
https://learn.microsoft.com/en-us/dotnet/standard/io/memory-mapped-files?redirectedfrom=MSDN
The article says: "Starting with the .NET Framework 4, you can use managed code to access memory-mapped files in the same way that native Windows functions access memory-mapped files". Would an ubuntu machine run into potential problems when reading these files in the same way that windows would? And if not, could someone give either a simple example of using these mapped files between the program languages mentioned above as well as pass a reference to these mapped files between the processes, or give a reference to where someone has already done this?
Or if someone knows how to directly pass a pointer to a byte array from C# to python, that would be even better if possible.
Any help is greatly appreciated!
So after coming back to this post four months later, i did not find a solution that satisfied my efficiency needs.
I had tried to find a way to get around having to write a large amount of data, already allocated in memory, to another process. Meaning i would have needed to reallocate that same data taking up double the amount of memory and adding additional overhead even though the data would be read-safe. However, it seemed as though the proper way to solve this, in this case, for two processes running on the same machine would be to use named pipes as they are faster than i.e. sockets. As Holger stated, this ended up being a question of interprocess-communication.
I ended up writing the whole application in python which happened to be the better alternative in the end anyways, probably saving me a lot of headache.
Performance wise, is it wrong to embed a file in a resource section of a dll?
This might seem silly, but aim trying to embed some info inside the dll which can later be fetched by some methods, in case the whole solution and documentation is lost and we have only the dll.
What are the downside of doing such a thing?
Is it suggested or prohibited ?
Embedded resources are done very efficiently. Under the hood, it uses the demand paged virtual memory capabilities of the operating system. The exact equivalent of a memory-mapped file. In other words, the resource is directly accessible in memory. And you don't pay for the resource until you start using it. The first access to the resource forces it to be read from the file and copied into RAM. And is very cheap to unmap again, the operating system can simply discard the page. There is no way to make it more efficient.
The other side of the medal is that it is permanently mapped into virtual memory. In other words, your process loses the memory space occupied by the resource. You'll run of out of available address space more quickly, an OutOfMemoryException is more likely.
This is not something you normally worry about until you gobble up, say, half a gigabyte in a 32-bit process. And don't fret about at all in a 64-bit process.
I am using CodeDom to dynamically compile an assembly in memory
(using CompilerParameters.GenerateInMemory=True)
and would like to know if there is any way, (using additional VB.NET code in my assembly) to prevent someone from being able to save a copy of the assembly to their desktop while the assembly is still running in memory?
Or is this even possible for the assembly to detect when someone is using some hacker type program to save a copy of my assembly, while its running in memory?
Experts let me know if it is possible and how to accomplish this?
A short answer would be "no". The critical problem with any security-through-obscurity measure is that at some point, the code has to run. In the case of a managed library, this applies to metadata as well (unless you write your own IL to native compiler), because it has to be compiled by the JIT compiler. You can't really stop the "hacker types", because even at the lowest point, they can analyze the native code, and observe the memory directly. True, there's more high-level hackers (and script-kiddies) than low-level hackers, but the point stands.
In case of dynamic assemblies, they're definitely in the memory as well, as you've pointed out yourself. In fact, I believe they have a distinct virtual memory space, so it isn't even that hard to find them in the memory :)
Are you trying to implement some copy protection scheme or something? That is pretty much impossible even with native code, managed code only makes it that much easier to remove the protection :)
As the title states, I have a problem with high page file activity.
I am developing a program that process a lot of images, which it loads from the hard drive.
From every image it generates some data, that I save on a list. For every 3600 images, I save the list to the hard drive, its size is about 5 to 10 MB. It is running as fast as it can, so it max out one CPU Thread.
The program works, it generates the data that it is supposed to, but when I analyze it in Visual Studio I get a warning saying: DA0014: Extremely high rates of paging active memory to disk.
The memory comsumption of the program, according to Task Manager is about 50 MB and seems to be stable. When I ran the program I had about 2 GB left out of 4 GB, so I guess I am not running out of RAM.
http://i.stack.imgur.com/TDAB0.png
The DA0014 rule description says "The number of Pages Output/sec is frequently much larger than the number of Page Writes/sec, for example. Because Pages Output/sec also includes changed data pages from the system file cache. However, it is not always easy to determine which process is directly responsible for the paging or why."
Does this mean that I get this warning simply because I read a lot of images from the hard drive, or is it something else? Not really sure what kind of bug I am looking for.
EDIT: Link to image inserted.
EDIT1: The images size is about 300 KB each. I dipose each one before loading the next.
UPDATE: Looks from experiments like the paging comes from just loading the large amount of files. As I am no expert in C# or the underlying GDI+ API, I don't know which of the answers are most correct. I chose Andras Zoltans answer as it was well explained and because it seems he did a lot of work to explain the reason to a newcomer like me:)
Updated following more info
The working set of your application might not be very big - but what about the virtual memory size? Paging can occur because of this and not just because of it's physical size. See this screen shot from Process Explorer of VS2012 running on Windows 8:
And on task manager? Apparently the private working set for the same process is 305,376Kb.
We can take from this a) that Task Manager can't necessarily be trusted and b) an application's size in memory, as far as the OS is concerned, is far more complicated than we'd like to think.
You might want to take a look at this.
The paging is almost certainly because of what you do with the files and the high final figures almost certainly because of the number of files you're working with. A simple test of that would be experiment with different numbers of files and generate a dataset of final paging figures alongside those. If the number of files is causing the paging, then you'll see a clear correlation.
Then take out any processing (but keep the image-loading) you do and compare again - note the difference.
Then stub out the image-loading code completely - note the difference.
Clearly you'll see the biggest drop in faults when you take out the image loading.
Now, looking at the Emgu.CV Image code, it uses the Image class internally to get the image bits - so that's firing up GDI+ via the function GdipLoadImageFromFile (Second entry on this index)) to decode the image (using system resources, plus potentially large byte arrays) - and then it copies the data to an uncompressed byte array containing the actual RGB values.
This byte array is allocated using GCHandle.Alloc (also surrounded by GC.AddMemoryPressure and GC.RemoveMemoryPressure) to create a pinned byte array to hold the image data (uncompressed). Now I'm no expert on .Net memory management, but it seems to me that what we have a potential for heap fragmentation here, even if each file is loaded sequentially and not in parallel.
Whether that's causing the hard paging I don't know. But it seems likely.
In particular the in-memory representation of the image could be specifically geared around displaying as opposed to being the original file bytes. So if we're talking JPEGs, for example, then a 300Kb JPEG could be considerably larger in physical memory, depending on its size. E.g. a 1027x768 32 bit image is 3Mb - and that's been allocated twice for each image since it's loaded (first allocation) then copied (second allocation) into the EMGU image object before being disposed.
But you have to ask yourself if it's necessary to find a way around the problem. If your application is not consuming vast amounts of physical RAM, then it will have much less of an impact on other applications; one process hitting the page file lots and lots won't badly affect another process that doesn't, if there's sufficient physical memory.
However, it is not always easy to determine which process is directly responsible for the paging or why.
The devil is in that cop-out note. Bitmaps are mapped into memory from the file that contains the pixel data using a memory-mapped file. That's an efficient way to avoid reading and writing the data directly into/from RAM, you only pay for what you use. The mechanism that keeps the file in sync with RAM is paging. So it is inevitable that if you process a lot of images then you'll see a lot of page faults. The tool you use just isn't smart enough to know that this is by design.
Feature, not a bug.
I have to create a C# program that deals well with reading in huge files.
For example, I have a 60+ mB file. I read all of it into a scintilla box, let's call it sci_log. The program is using roughly 200mB of memory with this and other features. This is still acceptable (and less than the amount of memory used by Notepad++ to open this file).
I have another scintilla box, sci_splice. The user inputs a search term and the program searches through the file (or sci_log if the file length is small enough--it doesn't matter because it happens both ways) to find a regexp.match. When it finds a match, it concatenates that line with a string that has previous matches and increases a temporary count variable. When count is 100 (or 150, or 200, any number really), then I put the output in sci_splice, call GC.Collect(), and repeat for the next 100 lines (setting count = 0, nulling the string).
I don't have the code on me right now as I'm writing this from my home laptop, but the issue with this is it's using a LOT of memory. The 200mB mem usage jumps up to well over 1gB with no end in sight. This only happens on a search with a lot of regexp matches, so it's something with the string. But the issue is, wouldn't the GC free up that memory? Also, why does it go up so high? It doesn't make sense for why it would more than triple (worst possible case). Even if all of that 200mB was just the log in memory, all it's doing is reading each line and storing it (at worst).
After some more testing, it looks like there's something wrong with Scintilla using a lot of memory when adding lines. The initial read of the lines has a memory spike up to 850mB for a fraction of a second. Guess I need to just page the output.
Don't call GC.Collect. In this case I don't think it matters because I think this memory is going to end up on the Large Object Heap (LOH). But the point is .Net knows a lot more about memory management than you do; leave it alone.
I suspect you are looking at this using Task Manager just by the way you are describing it. You need to instead use at least Perfmon. Anticipating you have not used it before go here and do pretty much what Tess does to where it says Get a Memory Dump. Not sure you are ready for WinDbg but that maybe your next step.
Without seeing code there is almost no way to know what it is going on. The problem could be inside of Scintilla too, but I would check through what you are doing first. By running perfmon you may at least get more information to figure out what to do next.
If you are using System.String to store your matching lines, I suggest you try replacing it with System.Text.StringBuilder and see if this makes any difference.
Try http://msdn.microsoft.com/en-us/library/system.io.memorymappedfiles.memorymappedfile(VS.100).aspx