Most efficient way to search for files

Most efficient way to search for files - c#

I am writing a program that searches and copies mp3-files to a specified directory.
Currently I am using a List that is filled with all the mp3s in a directory (which takes - not surprisingly - a very long time.) Then I use taglib-sharp to compare the ID3Tags with the artist and title entered. If they match I copy the file.
Since this is my first program and I am very new to programming I figure there must be a better/more efficient way to do this. Does anybody have a suggestion on what I could try?
Edit: I forgot to add an important detail: I want to be able to specify what directories should be searched every time I start a search (the directory to be searched will be specified in the program itself). So storing all the files in a database or something similar isn't really an option (unless there is a way to do this every time which is still efficient). I am basically looking for the best way to search through all the files in a directory where the files are indexed every time. (I am aware that this is probably not a good idea but I'd like to do it that way. If there is no real way to do this I'll have to reconsider but for now I'd like to do it like that.)

You are mostly saddled with the bottleneck that is IO, a consequence of the hardware with which you are working. It will be the copying of files that is the denominator here (other than finding the files, which is dwarfed compared to copying).
There are other ways to go about file management, and each exposing better interfaces for different purposes, such as NTFS Change Journals and low-level sector handling (not recommended) for example, but if this is your first program in C# then maybe you don't want to venture into p/invoking native calls.
Other than alternatives to actual processes, you might consider mechanisms to minimise disk access - i.e. not redoing anything you have already done, or don't need to do.

Use an database (simple binary serialized file or an embedded database like RavenDb) to cache all files. And query that cache instead.
Also store modified time for each folder in the database. Compare the time in the database with the time on the folder each time you start your application (and sync changed folders).
That ought to give you much better performance. Threading will not really help searching folders since it's the disk IO that takes time, not your application.

Related

Way to obtain full directory information in a batch

We’ve got a process that obtains a list of files from a remote transatlantic Samba share. This is naturally on the slow side, however it’s made worse by the fact that we don’t just need the names of the files, we need the last write times to spot updates. There’s quite a lot of files in the directory, and as far as I can tell, the .NET file API insists on me asking for each one individually. Is there a faster way of obtaining the information we need?

I would love to find a way myself. I have exactly the same problem - huge number of files on a slow network location, and I need to scan for changes.
As far as I know, you do need to ask for file properties one by one.
The amount of information transferred per file should not be high though; the roundabout request-response time is probably the main problem. You can help the situation by running multiple requests in parallel (e.g. using Parallel.ForEach)

The answer to your question is most likely no, at least not in a meaningful way.
Exactly how you enumerate the files in your code is almost irrelevant since they all boil down to the same file system API in Windows. Unfortunately, there is no function that returns a list of file details in one call*.
So, no matter what your code looks like, somewhere below, it's still enumerating the directory contents and calling a particular file function individually for each file.
If this is really a problem, I would look into moving the detection logic closer to the files and send your app the results periodically.
*Disclaimer: It's been a long time since I've been down this far in the stack and I'm just browsing the API docs now, there may be a new function somewhere that does exactly this.

How to make subsequent instances of an assembly share the same memory?

I want something like a static class variable, except when different applications load my assembly I want them all to be sharing the same variable.
I know I could write to disk or to a database, but this is for a process that's used with sql queries and that would probably slow it down too much (actually I am going to test these options out but I'm asking this question in the meantime b/c I don't think it's going to be an acceptable solution).
I would prefer to use the solution that incurrs the least overhead in deployment, and I don't mind if the solution isn't easy to create so long as it's easy to use when I'm done.
I'm aware that there are some persistent memory frameworks out there. I haven't checked any of them out yet and maybe one of them would be perfect so feel free to recommend one. I am also perfectly content to write something myself, particularly if it makes deployment easier for me to do so.
Thanks in advance for any and all suggestions!
Edit: Looks like I was overlooking a really easy solution. My problem involved SQL only providing 8000 bytes of space to serialize data between calls to a SQL aggregate function I wrote. I read an article on how to compress your data and get the most out of that 8000 bytes, and assumed there was nothing more I could do. As it turns out, I can set the MaxBytes = -1 instead of a range between 0 to 8000 to get up to 2gb of space. I believe that this was something new they added in the 3.5 framework because there are various articles out there talking about this 8000 byte limitation.
Thank you all for you answers though as this is a problem I've wanted to solve for other reasons in the past and now I know what to do if I need a really easy and fast way to communicate between apps.

You can't store this as in-memory data and have it shared between processes, since each process has it's own isolated memory address space.
One option, however, would be to use the .NET Memory-mapped file support to "store" the shared data. This would allow you to write a file that contained the information in a place that every process could access.

Each process has its own address space. You cannot simply share a variable like you intend
You can use shared memory though.
If you are on .NET 4, you can simply use Memory-Mapped Files

If you want some sort of machine-wide count or locking you can look into use of named synchronization objects like semaphore - http://msdn.microsoft.com/en-us/library/z6zx288a.aspx or mutexes http://msdn.microsoft.com/en-us/library/hw29w7t1.aspx. When name is specified such objects are machine-wide instead of process-wide.

Secure wipe a directory

I know how to wipe a file in C# including it's sectors and such.
But how do I overwrite the directories themselves?
Example: #"C:\mydirectory\" must be unrecoverable gone forever (all files insides are already wiped) so that it will be impossible to recover the directory structure or their names.
------------------- Update below (comment formatting is such a hassle so I post it here)---
For the file deletion I look up the partition's cluster and section size's and overwrite it at least 40 times using 5 different algorithms where the last algorithm is always the random one. The data is also actually written to the disk each time (and not just in memory or something). The only risk is that when I wipe something the physical address on the disk of that file could theoretically have been changed. The only solution I know for that is to also wipe the free disk space after the file has been wiped and hope that no other file currently partially holds the old physical location of the wiped file. Or does Windows not do such a thing?
http://www.sans.org/reading_room/whitepapers/incident/secure-file-deletion-fact-fiction_631 states:
"It is important to note the consensus that overwriting the data only reduces the
likelihood of data being recovered. The more times data is overwritten, the more
expensive and time consuming it becomes to recover the data. In fact Peter Guttman
states “…it is effectively impossible to sanitize storage locations by simple overwriting
them, no matter how many overwrite passes are made or what data patterns are written.”3 Overwritten data can be recovered using magnetic force microscopy, which
deals with imaging magnetization patterns on the platters of the hard disk. The actual
details of how this is accomplished are beyond the scope of this paper."
Personally I believe that when I overwrite the data like 100+ times using different (maybe unknown) algorithms (and if there is no copy of the data left elsewhere like in the swap files) that it will take any very expensive team of professionals many many years to get that data back. And if they do get the data back after all those years then they deserve it I guess... That must be a project for life.
So:
wiping unused data: use cipher (http://support.microsoft.com/kb/315672) or fill the hard-disk with 4GB files or use the Eraser command line executable.
wiping swap files: ?
wiping bad sectors: ?
wiping directories: use Eraser (as Teoman Soygul stated)
How do we know for sure that we overwrote the actual physical addresses?
wiping the most recently used files and the Windows log files should of course be piece a cake for any programmer :)
Eraser solves most of the above problems but cannot wipe the pages files. So any forensic will still find the data back if it was in those swap files at any moment.
AFAIK eraser does not wipe the file allocation tables. But I'm not sure.
And the conclusion should then be: It's (near) impossible to secure wipe in C#?

there is no general approach for this ... consider a SSD: you can't even make sure that your write operation will write to the same physical address, because of wear-levelling methods

If all files/folders inside the folder is already wiped (as you stated), all that is left is the directory itself. Using a cryptic random number generator rename the directory and delete it. It will be as good as wiped.
If this isn't enough for you, grab a copy of Eraser command line executable and execute the command:
Process.Start("eraserl.exe", #"-folder "C:\MyDirectory\" -subfolders -method DoD_E -silent");

Securely deleting is not straightforward, as you know. So it may be worth considering an alternative strategy.
Have you considered using something like TrueCrypt to create an encrypted volume? You could store the files there, then use standard delete routines. An adversary would then need to both decrypt the encrypted volume AND recover the deleted files.

C# Is there a faster way than FileInfo.Length with multiple files comparison?

I have been trying to compare files from two differents environments. Both environment are accessed via a network drive.
First, I checked if the file is present in both environments (to save time), then I ask for a FileInfo on both files to compare their filesizes (FileInfo.Length).
My goal is to populate a listview with every files that does not have the same size to investigate on these later.
I find it hard to understand that windows explorer gets the filesize of a lot of files so quickly and that FileInfo is taking so long...
Thanks.
Ben

If you use DirectoryInfo.GetFiles(some-search-pattern) to get the files, then use the FileInfo-instance returned from that call, then the property FileInfo.Length will be cached from the search.
Whether this will be of any help obviously depends on how the search performs in comparison to the check you are currently doing (if the search is slower, you may not gain anything). Still, it may be worth looking into.

You can try doing it the operating system way. You call via win32 system calls. You can start with this page,
http://www.pinvoke.net/default.aspx/kernel32.getfilesize

Synchronizing filesystem and cached data on program startup

I have a program that needs to retrieve some data about a set of files (that is, a directory and all files within it and sub directories of certain types). The data is (very) expensive to calculate, so rather than traversing the filesystem and calculating it on program startup, I keep a cache of the data in a SQLite database and use a FilesystemWatcher to monitor changes to the filesystem. This works great while the program is running, but the question is how to refresh/synchronize the data during program startup. If files have been added (or changed -- I presume I can detect this via last modified/size) the data needs to be recomputed in the cache, and if files have been removed, the data needs to be removed from the cache (since the interface traverses the cache instead of the filesystem).
So the question is: what's a good algorithm to do this? One way I can think of is to traverse the filesystem and gather the path and last modified/size of all files in a dictionary. Then I go through the entire list in the database. If there is not a match, then I delete the item from the database/cache. If there is a match, then I delete the item from the dictionary. Then the dictionary contains all the items whose data needs to be refreshed. This might work, however it seems it would be fairly memory-intensive and time-consuming to perform on every startup, so I was wondering if anyone had better ideas?
If it matters: the program is Windows-only written in C# on .NET CLR 3.5, using the SQLite for ADO.NET thing which is being accessed via the entity framework/LINQ for ADO.NET.

Our application is cross-platform C++ desktop application, but has very similar requirements. Here's a high-level description of what I did:
In our SQLite database there is a Files table that stores file_id, name, hash (currently we use last modified date as the hash value) and state.
Every other record refers back to a file_id. This makes is easy to remove "dirty" records when the file changes.
Our procedure for checking the filesystem and refreshing the cache is split into several distinct steps to make things easier to test and to give us more flexibility as to when the caching occurs (the names in italics are just what I happened to pick for class names):
On 1st Launch
The database is empty. The Walker recursively walks the filesystem and adds the entries into the Files table. The state is set to UNPROCESSED.
Next, the Loader iterates through the Files table looking for UNPARSED files. These are handed off to the Parser (which does the actual parsing and inserting of data)
This takes a while, so 1st launch can be a bit slow.
There's a big testability benefit because you can test the walking the filesystem code independently from the loading/parsing code. On subsequent launches the situation is a little more complicated:
n+1 Launch
The Scrubber iterates over the Files table and looks for files that have been deleted and files that have been modified. It sets the state to DIRTY if the file exists but has been modified or DELETED if the file no longer exists.
The Deleter (not the most original name) then iterates over the Files table looking for DIRTY and DELETED files. It deletes other related records (related via the file_id). Once the related records are removed, the original File record is either deleted or set back to state=UNPARSED
The Walker then walks the filesystem to pick-up new files.
Finally the Loader loads all UNPARSED files
Currently the "worst case scenario" (every file changes) is very rare - so we do this every time the application starts-up. But by splitting the process up unto these steps we could easily extend the implementation to:
The Scrubber/Deleter could be refactored to leave the dirty records in-place until after the new
data is loaded (so the application "keeps working" while new data is cached into the database)
The Loader could load/parse on a background thread during an idle time in the main application
If you know something about the data files ahead of time you could assign a 'weight' to the files and load/parse the really-important files immediately and queue-up the less-important files for processing at a later time.
Just some thoughts / suggestions. Hope they help!

Windows has a change journal mechanism, which does what you want: you subscribe to changes in some part of the filesystem and upon startup can read a list of changes which happened since last time you read them. See: http://msdn.microsoft.com/en-us/library/aa363798(VS.85).aspx
EDIT: I think it requires rather high privileges, unfortunately

The first obvious thing that comes to mind is creating a separate small application that would always run (as a service, perhaps) and create a kind of "log" of changes in the file system (no need to work with SQLite, just write them to a file). Then, when the main application starts, it can look at the log and know exactly what has changed (don't forget to clear the log afterwards :-).
However, if that is unacceptable to you for some reason, let us try to look at the original problem.
First of all, you have to accept that, in the worst case scenario, when all the files have changed, you will need to traverse the whole tree. And that may (although not necessarily will) take a long time. Once you realize that, you have to think about doing the job in background, without blocking the application.
Second, if you have to make a decision about each file that only you know how to make, there is probably no other way than going through all files.
Putting the above in other words, you might say that the problem is inherently complex (and any given problem cannot be solved with an algorithm that is simpler than the problem itself).
Therefore, your only hope is reducing the search space by using tweaks and hacks. And I have two of those on my mind.
First, it's better to query the database separately for every file instead of building a dictionary of all files first. If you create an index on the file path column in your database, it should be quicker, and of course, less memory-intensive.
Second, you don't actually have to query the database at all :-)
Just store the exact time when your application was last running somewhere (in a .settings file?) and check every file to see if it's newer than that time. If it is, you know it's changed. If it's not, you know you've caught it's change last time (with your FileSystemWatcher).
Hope this helps. Have fun.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.