Shredding files in .NET

Shredding files in .NET - c#

Is there a SDK that can be used in managed code to shred files securely?
EDIT: This is the only link i could find in google that helps me
EDIT: Either SDK or some kind of COM based component.

This code from codeproject may be a good starting point.
Eraser has been around for years, you could call out to it by using System.Diagnostics.Process, or at least review the algorithm there.

Take a look at Windows.WinAny.Helper at the CodePlex. It has SecureDelete extension which allows you to shredd files with different algorithms like Gutmann, DoD-7, DoD-3, Random or Quick.

Technology has changed in the past few years so when I happened to see this answer (why wasn't an answer accepted again?) I wanted to provide an update for others with similar questions.
Please note that shredding is very much filesystem and media dependent. Attempting to "shred" a file on a log based filesystem or a filesystem stored on smart (write leveling) flash isn't going to get you very far. You would have to, at a minimum, write enough data to complete fill the device to hope that the old data might be overwritten one time.
More likely you would have to write several smaller files and when you get FS full, delete one and then keep writing a new one, to ensure that all reserved space has been overwritten as well. Then you will probably be fairly safe. Probably.
I say probably because the storage media/FS could decide that a block was failing (or used too much relatively) and map it away substituting some other part of the disk instead. This is a per-block thing of course, so any much larger file is unlikely to be reconstructed.

Related

A file for comments

I wrote an app in which there is a ton of comments.
This may be a bit unusual, but I would like to know if there is a way in Visual Studio to elegantly arrange the comments in a dedicated file. Or, is there a way to write text somewhere in a text file?
It is true that using comments is great, but my code is still congested.
Currently I plan to use a new class for comments, which will only contain comments with details on the parts of the code that are concerned.
If you have a better idea, thank you very much for sharing it.

I can't stress enough how much of a bad idea this is.
Code comments are best:
Being near the code they concern
Short and simple
Used sparingly - code often changes, comments can very quickly fall out of sync with this, and then before you know it the comments are doing more harm than good.
If there really is some functional explanation you're trying to get across, e.g. why something is needed and how it works, rather than how to use it, I'd recommend writing a document to explain this.
There are all sorts of ways to do this:
Word documents on a shared system (e.g. a network drive / sharepoint)
A wiki system online / internally (e.g. Atlassian Confluence, or GitHub wiki)
(to name a couple)
As per other user's suggestions though, you should ensure that there aren't a lot of comments as they just add noise (something you're clearly discovering).
Sidenote: I once worked for a company that insisted on using comments everywhere, every function had to have a banner comment with its name, signature, who wrote it and an edit history (even though we used source control), and nearly every line of code had to be commented to state what it was (supposedly) doing. If you're in a similar position, perhaps try to explain the problems this causes?

Creating a DSP system from scratch

I love electronic music and I am interested in how it all ticks.
I've found lots of helpful questions on Stack Overflow on libraries that can be used to play with audio, filters etc. But what I am really curious about is what is actually hapening: how is the data being passed between effects and oscillators? I have done research into the mathematical side of dsp and I've got that end of the problem sussed but I am unsure what buffering system to use etc. The final goal is to have a simple object heirarchy of effects and oscillators that pass the data between each other (maybe using multithreading if I don't end up pulling out all my hair trying to implement it). It's not going to be the next Propellerhead Reason but I am interested in how it all works and this is more of an exercise than something that will yeild an end product.
At the moment I use .net and C# and I have recently learnt F# (which may or may not lead to some interesting ways of handling the data) but if these are not suitable for the job I can learn another system if necessary.
The question is: what is the best way to get the large amounts of signal data through the program using buffers? For instance would I be better off using a Queue, Array,Linked List etc? Should I make the samples immutable and create a new set of data each time I apply an effect to the system or just edit the values in the buffer? Shoud I have a dispatcher/thread pool style object that organises passing data or should the effect functions pass data directly between each other?
Thanks.
EDIT: another related question is how would I then use the windows API to play this array? I don't really want to use DirectShow because Microsoft has pretty much left it to die now
EDIT2: thanks for all the answers. After looking at all the technologies I will either use XNA 4(I spent a while trawling the internet and found this site which explains how to do it) or NAudio to output the music... not sure which one yet, depends on how advanced the system ends up being. When C# 5.0 comes out I will use its async capabilities to create an effects architecture on top of that. I've pretty much used everybody's answer equally so now I have a conundrum of who to give the bounty to...

Have you looked at VST.NET (http://vstnet.codeplex.com/)? It's a library to write VST using C# and it has some examples. You can also consider writing a VST, so that your code can be used from any host application (but even if you don't want, looking at their code can be useful).
Signal data is usually big and requires a lot of processing. Do not use a linked list! Most libraries I know simply use an array to put all the audio data (after all, that's what the sound card expect).
From a VST.NET sample:
public override void Process(VstAudioBuffer[] inChannels, VstAudioBuffer[] outChannels)
{
VstAudioBuffer audioChannel = outChannels[0];
for (int n = 0; n < audioChannel.SampleCount; n++)
{
audioChannel[n] = Delay.ProcessSample(inChannels[0][n]);
}
}
The audioChannel is a wrapper around an unmanaged float* buffer.
You probably store your samples in an immutable array. Then, when you want to play them, you copy the data in the output buffer (change the frequency if you want) and perform effects in this buffer. Note you can use several output buffers (or channels) and sum them at the end.
Edit
I know two low-level ways to play your array: DirectSound and WaveOut from Windows API. C# Example using DirectSound. C# example with WaveOut. However, you might prefer use an external higher-level library, like NAudio. NAudio is convenient for .NET audio manipulation - see this blog post for sending a sine wave to the audio card. You can see they are also using an array of float, which is what I recommend (if you do your computations using bytes, you'll end up with a lot of aliasing in the sound).

F# is probably a good choice here, as it's well fitted to manipulate functions. Functions are probably good building blocks for signal creation and processing.
F# is also good at manipulating collections in general, and arrays in particular, thanks to the higher-order functions in the Array module.
These qualities make F# popular in the finance sector and are also useful for signal processing, I would guess.
Visual F# 2010 for Technical Computing has a section dedicated to Fourier Transform, which could be relevant to what you want to do. I guess there is plenty of free information about the transform on the net, though.
Finally, to play samples, you can use XNA. I think the latest version of the API (4.0) also allows recording, but I have never used that. There is a famous music editing app for the Xbox called ezmuse+ Hamst3r Edition that uses XNA, so it's definitely possible.

With respect to buffering and asynchrony/threading/synchronization issues I suggest you to take a look at the new TPL Data Flow library. With its block primitives, concurrent data structures, data flow networks, async message prcessing, and TPL's Task based abstraction (that can be used with the async/await C# 5 features), it's a very good fit for this type of applications.

I don't know if this is really what you're looking for, but this was one of my personal projects while in college. I didn't truly understand how sound and DSP worked until I implemented it myself. I was trying to get as close to the speaker as possible, so I did it using only libsndfile, to handle the file format intricacies for me.
Basically, my first project was to create a large array of doubles, fill it with a sine wave, then use sf_writef_double() to write that array to a file to create something that I could play, and see the result in a waveform editor.
Next, I added another function in between the sine call, and the write call, to add an effect.
This way you start playing with very low-level oscillators and effects, and you can see the results immediately. Plus, it's very little code to get something like this working.
Personally, I would start with the simplest possible solution you can, then slowly add on. Try just writing out to a file and using your audio player to play it, so you don't have to deal with the audio apis. Just use a single array to start, and modify-in-place. Definitely start off single-threaded. As your project grows, you can start moving to other solutions, like pipes instead of the array, multi-threading it, or working with the audio API.
If you're wanting to create a project you can ship, depending on exactly what it is, you'll probably have to move to more complex libraries, like some real-time audio processing. But the basics you learn by doing the simple way above will definitely help when you get to this point.
Good luck!

I've done quite a bit of real-time DSP, although not with audio. While either of your ideas (immutable buffer) vs (mutable buffer modified in place) could work, what I prefer to do is create a single permanent buffer for each link in the signal path. Most effects don't lend themselves well to modification in place, since each input sample affects multiple output samples. The buffer-for-each-link technique works especially well when you have resampling stages.
Here, when samples arrive, the first buffer is overwritten. Then the first filter reads the new data from its input buffer (the first buffer) and writes to its output (the second buffer). Then it invokes the second stage to read from the second buffer and write into the third.
This pattern completely eliminates dynamic allocation, allows each stage to keep a variable amount of history (since effects need some memory), and is very flexible as far as enabling rearranging the filters in the path.

Alright, I'll have a stab at the bounty as well then :)
I'm actually in a very similar situation. I've been making electronic music for ages, but only over the past couple of years I've started exploring actual audio processing.
You mention that you have researched the maths. I think that's crucial. I'm currently fighting my way through Ken Steiglitz' A Digital Signal Processing Primer - With Applications to Digital Audio and Computer Music. If you don't know your complex numbers and phasors it's going to be very difficult.
I'm a Linux guy so I've started writing LADSPA plugins in C. I think it's good to start at that basic level, to really understand what's going on. If I was on Windows I'd download the VST SDK from Steinberg and write a quick proof of concept plugin that just adds noise or whatever.
Another benefit of choosing a framework like VST or LADSPA is that you can immediately use your plugins in your normal audio suite. The satisfaction of applying your first home-built plugin to an audio track is unbeatable. Plus, you will be able to share your plugins with other musicians.
There are probably ways to do this in C#/F#, but I would recommend C++ if you plan to write VST plugins, just to avoid any unnecessary overhead. That seems to be the industry standard.
In terms of buffering, I've been using circular buffers (a good article here: http://www.dspguide.com/ch28/2.htm). A good exercise is to implement a finite response filter (what Steiglitz refers to as a feedforward filter) - these rely on buffering and are quite fun to play around with.
I've got a repo on Github with a few very basic LADSPA plugins. The architectural difference aside, they could potentially be useful for someone writing VST plugins as well. https://github.com/andreasjansson/my_ladspa_plugins
Another good source of example code is the CSound project. There's tonnes of DSP code in there, and the software is aimed primarily at musicians.

Start with reading this and this.
This will give you idea on WHAT you have to do.
Then, learn DirectShow architecture - and learn HOW not to do it, but try to create your simplified version of it.

You could have a look at BYOND. It is an environment for programmatic audio / midi instrument and effect creation in C#. It is available as standalone and as VST instru and effect.
FULL DISCLOSURE I am the developer of BYOND.

FASTEST directory listing

I have massive directories, and I would like to read all the files as fast as I can. I mean, not DirectoryInfo.GetFiles fast, but 'get-clusters-from-disk-low-level' fast.
Of course, .NET 2.0, c#
Similar question was here, but this approach wasn't any good:
C# Directory listing massive directory
Someone suggested pInvoke on FindFirst/FindNext. Anybody tried that and is able to share results?

For a "normal" approach, basically everything boils down to FindFirstFile/FindNextFile, you don't really get much faster than that... and that isn't super-turbo-fast.
If you really need speed, look into reading the MFT manually - but know that this requires admin privileges, and is prone to break whenever NTFS gets updated (and, oh yeah, won't work for non-NTFS filesystems). You might want to have a look at this code which has USN and MFT stuff.
However, perhaps there's a different solution. If your app is running constantly and needs to pick up changes, you can start off by doing one slow FindFirstFile/FindNextFile pass, and then use directory change notification support to be informed of updates... that works for limited users, and doesn't depend on filesystem structures.

For the best performance, it is possible to P/Invoke NtQueryDirectoryFile, documented as ZwQueryDirectoryFile.
(That short of accessing the disk directly and reading the raw file system structures directly, which usually is not practical.)

Try using something like this DirectoryManager and refine it by your needs. Works faster than the .NET Framework GetDirectories() or GetFiles() because we ommitted there cross-platform checkings and adaptations.

C# solution for analysing files as they are written/modified

I have several projects that require me to monitor files, and then edit them as they are getting written to disk. I have a feeling that what I am looking for is operationally the same as how anti-virus tools operate. Let me give more details:
1) I need to trap all files saved by Office application, and then add specific company tags to the headers/footers of each document as they are getting written to disk.
2) I need to know immediately when an editable file (of pretty much any type) is written to disk, so that I can undertake some scanning operations to check if files content meets certain company policies.
In short, you can see that I need to process any user files as they are being written to disk.
Here is my problem. I want to use C# for this task, but I am not sure if it has the ability to meet my requirements. Everything I have seen on the net is geared towards lower-level C programming, which I specifically want to avoid due to time constraints for this project. Anyone aware of how to easily do this task in C#? Is it even feasible (ie too high-level a language, too slow a language etc.)?

Performance won't be the issue. I guess I'd question the entire process- it sounds like a recipe for disaster. You can easily hack something together in C# using a FileSystemWatcher in a matter of minutes, but it will be fraught with issues. AV software is bad enough about locking files and screwing up various software, and it's not even trying to modify the file. How do you know when the other app is "done" writing the file? What do you do when you've got the file locked and something else breaks because it can't get access?

Have you looked at the FileSystemWatcher?

C# can easily do this. Look at the FileSystemWatcher class (http://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.aspx).

Binary patch-generation in C#

Does anyone have, or know of, a binary patch generation algorithm implementation in C#?
Basically, compare two files (designated old and new), and produce a patch file that can be used to upgrade the old file to have the same contents as the new file.
The implementation would have to be relatively fast, and work with huge files. It should exhibit O(n) or O(logn) runtimes.
My own algorithms tend to either be lousy (fast but produce huge patches) or slow (produce small patches but have O(n^2) runtime).
Any advice, or pointers for implementation would be nice.
Specifically, the implementation will be used to keep servers in sync for various large datafiles that we have one master server for. When the master server datafiles change, we need to update several off-site servers as well.
The most naive algorithm I have made, which only works for files that can be kept in memory, is as follows:
Grab the first four bytes from the old file, call this the key
Add those bytes to a dictionary, where key -> position, where position is the position where I grabbed those 4 bytes, 0 to begin with
Skip the first of these four bytes, grab another 4 (3 overlap, 1 one), and add to the dictionary the same way
Repeat steps 1-3 for all 4-byte blocks in the old file
From the start of the new file, grab 4 bytes, and attempt to look it up in the dictionary
If found, find the longest match if there are several, by comparing bytes from the two files
Encode a reference to that location in the old file, and skip the matched block in the new file
If not found, encode 1 byte from the new file, and skip it
Repeat steps 5-8 for the rest of the new file
This is somewhat like compression, without windowing, so it will use a lot of memory. It is, however, fairly fast, and produces quite small patches, as long as I try to make the codes output minimal.
A more memory-efficient algorithm uses windowing, but produces much bigger patch files.
There are more nuances to the above algorithm that I skipped in this post, but I can post more details if necessary. I do, however, feel that I need a different algorithm altogether, so improving on the above algorithm is probably not going to get me far enough.
Edit #1: Here is a more detailed description of the above algorithm.
First, combine the two files, so that you have one big file. Remember the cut-point between the two files.
Secondly, do that grab 4 bytes and add their position to the dictionary step for everything in the whole file.
Thirdly, from where the new file starts, do the loop with attempting to locate an existing combination of 4 bytes, and find the longest match. Make sure we only consider positions from the old file, or from earlier in the new file than we're currently at. This ensures that we can reuse material in both the old and the new file during patch application.
Edit #2: Source code to the above algorithm
You might get a warning about the certificate having some problems. I don't know how to resolve that so for the time being just accept the certificate.
The source uses lots of other types from the rest of my library so that file isn't all it takes, but that's the algorithm implementation.
#lomaxx, I have tried to find a good documentation for the algorithm used in subversion, called xdelta, but unless you already know how the algorithm works, the documents I've found fail to tell me what I need to know.
Or perhaps I'm just dense... :)
I took a quick peek on the algorithm from that site you gave, and it is unfortunately not usable. A comment from the binary diff file says:
Finding an optimal set of differences requires quadratic time relative to the input size, so it becomes unusable very quickly.
My needs aren't optimal though, so I'm looking for a more practical solution.
Thanks for the answer though, added a bookmark to his utilities if I ever need them.
Edit #1: Note, I will look at his code to see if I can find some ideas, and I'll also send him an email later with questions, but I've read that book he references and though the solution is good for finding optimal solutions, it is impractical in use due to the time requirements.
Edit #2: I'll definitely hunt down the python xdelta implementation.

Sorry I couldn't be more help. I would definately keep looking at xdelta because I have used it a number of times to produce quality diffs on 600MB+ ISO files we have generated for distributing our products and it performs very well.

bsdiff was designed to create very small patches for binary files. As stated on its page, it requires max(17*n,9*n+m)+O(1) bytes of memory and runs in O((n+m) log n) time (where n is the size of the old file and m is the size of the new file).
The original implementation is in C, but a C# port is described here and available here.

Have you seen VCDiff? It is part of a Misc library that appears to be fairly active (last release r259, April 23rd 2008). I haven't used it, but thought it was worth mentioning.

It might be worth checking out what some of the other guys are doing in this space and not necessarily in the C# arena either.
This is a library written in c#
SVN also has a binary diff algorithm and I know there's an implementation in python although I couldn't find it with a quick search. They might give you some ideas on where to improve your own algorithm

If this is for installation or distribution, have you considered using the Windows Installer SDK? It has the ability to patch binary files.
http://msdn.microsoft.com/en-us/library/aa370578(VS.85).aspx

This is a rough guideline, but the following is for the rsync algorithm which can be used to create your binary patches.
http://rsync.samba.org/tech_report/tech_report.html

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.