I have built a small app that allows me to choose a directory and count the total size of files in that directory and its sub directories.
It allows me to select a drive and this populates a tree control with the drives immediate folders which I can then count its size!
It is written in .net and simply loops round on the directories and for each directory adds up the file sizes.
It brings my pc to a halt when It runs on say the windows or program files folders.
I had thought of Multi threading but I haven't done this before.
Any ideas to increase performance?
thanks
Your code is really going to slog since you're just using strings to refer to directories and files. Use a DirectoryInfo on your root directory; get a list of FileSystemInfos from that one using DirectoryInfo.GetFileSystemInfos(); iterate on that list, recursing in for DirectoryInfo objects and just adding the size for FileInfo objects. That should be a LOT faster.
I'd simply suggest using a background worker to preform the work. You'll probably want to make sure controls that shouldn't be usable aren't but anything that would be usable could stay usable.
Google: http://www.google.com/search?q=background+worker
This would allow your application to be multi-threaded with out some of the complexity of multiple threads. Everything has been packaged up and it convenient to use.
Do you want to increase performance or increase system responsiveness?
You can increase RESPONSIVENESS by instructing the spidering application to run its message queue loop periodically, which handles screen repaints, etc. This would allow you to give a progress update as it executes the scan, while actually decreasing performance (because you're yielding CPU priority).
This gets sub-directories:
string[] directories = Directory.GetDirectories(node.FullPath);
foreach (string dir in directories)
{
TreeNode nd = node.Nodes.Add(dir, dir.Substring(dir.LastIndexOf("\\")).Replace("\\", ""), 3);
if (showItsChildren)
ShowChildDirectories(nd, true);
size += GetDirectorySize(nd.FullPath);
}
Thsi counts file sizes:
long b = 0;
// Get array of all file names.
string[] a = Directory.GetFiles(p, "*.*");
// 2
// Calculate total bytes of all files in a loop.
foreach (string name in a)
{
// 3
// Use FileInfo to get length of each file.
FileInfo info = new FileInfo(name);
b += info.Length;
IncrementCount();
}
Try to comment out all the parts that update the UI, if it's still slow it's the disk I/O and there's nothing you can do, if it gets faster you can update the UI every X files to save UI work.
You can make your UI responsive by doing all the work in a worker thread, but it will make it slightly slower.
Disk I/O is relatively slow it is also often needed by other applications (swap file, temp files ...) also, multi-threading won't help much, all the file are on the same physical disk and it's likely the disk I/O is the bottleneck.
Just a guess, but I bet your performance hit involves the UI and not the file scan. Comment out the code that creates the TreeNode.
Try to not make your tree paint until after you complete your scan:
Make sure that the root tree node for all of your files is NOT added to the treey. Add all the children, and then add the "top" node/nodes at the very end of your processing. See how that works.
Related
Presently working on an application that allows the user to input a list of names/search terms and an folder path. The application then searches for each phrase and copies any responsive documents to an output path. Important to note that the use is often with directories containing from 100GB up to a few TB and can sometimes be required to run thousands of search terms.
Initially I simply used the System.IO.GetFiles() function for this, but I've found I have better results creating a data table of all documents in the input path and running my searches over that data table (see below).
//Constructing a data table of all files in the input path
foreach (var file in fileArray)
{
System.Data.DataRow row = searchTable.NewRow();
row[1] = file;
row[0] = System.IO.Path.GetFileName(file);
searchTable.Rows.Add(row);
}
//For each line inputted by the user, search the data table to find any responsive file names
foreach (var line in searchArray)
{
for (int i = 0; i < searchTable.Rows.Count; i++)
{
if (searchTable.Rows[i][0].ToString().Contains(line))
{
string file = searchTable.Rows[i][1].ToString();
string output = SwiftBank.CalculateOutputFilePath(outputPath,inputPath,file);
System.IO.File.Copy(file, output);
}
}
}
I've found that while this works, it isn't optimised and functions very slowly for large data sets. Obviously doing a lot of repeat work, searching the data table in full every search term. Wondering if someone on here might have a better idea?
In my experience, doing a handful of contains queries for a few thousands fairly short strings should take less than a second. If you are searching inside much larger data sets, like searching thru 100Gb of content, you should look at some more advanced library, like lucene.
There are a few things I would suggest changing
Use a regular list instead of a datatable. Something like List<(string filePath, string fileName)> would be much simpler, and contain the same information
Perform all checks for a specific file at once, i.e. reorder your loops the the file loop is the outer one. This should help cache usage a little bit.
However, the vast majority of the time will likely be spent on copying files. This is many orders of magnitude slower than doing some simple searching in a few kilobytes of memory. You might gain a little bit by doing more than one copy in parallel, since SSDs may be able to improve throughput at higher loads, but that is likely only true if the files are small. You might consider alternatives to the copying, like adding shortcuts, instead.
I'm trying to locate a line which contains a specific text inside a large text file (18 MB), currently I'm using StreamReader to open the file and read it line by line checking if it contains the search string
while ((line = reader.ReadLine()) != null)
{
if (line.Contains("search string"))
{
//Do something with line
}
}
But unfortunately, because the file I'm using has more than 1 million records, this method is slow. What is the quickest way to achieve this?
In general, disk IO of this nature is just going to be slow. There is likely little you can do to improve over your current version in terms of performance, at least not without dramatically changing the format in which you store your data, or your hardware.
However, you could shorten the code and simplify it in terms of maintenance and readability:
var lines = File.ReadLines(filename).Where(l => l.Contains("search string"));
foreach(var line in lines)
{
// Do something here with line
}
Reading the entire file into memory causes the application to hang and is very slow, do you think there are any other alternative
If the main goal here is to prevent application hangs, you can do this in the background instead of in a UI thread. If you make your method async, this can become:
while ((line = await reader.ReadLineAsync()) != null)
{
if (line.Contains("search string"))
{
//Do something with line
}
}
This will likely make the total operation take longer, but not block your UI thread while the file access is occurring.
Get a hard drive with a faster read speed (moving to a solid state drive if you aren't already would likely help a lot).
Store the data across several files each on different physical drives. Search through those drives in parallel.
Use a RAID0 hard drive configuration. (This is sort of a special case of the previous approach.)
Create an index of the lines in the file that you can use to search for specific words. (Creating the index will be a lot more expensive than a single search, and will require a lot of disk space, but it will allow subsequent searches at much faster speeds.)
Environment: Any .Net Framework welcomed.
I have a log file that gets written to 24/7.
I am trying to create an application that will read the log file and process the data.
What's the best way to read the log file efficiently? I imagine monitoring the file with something like FileSystemWatcher. But how do I make sure I don't read the same data once it's been processed by my application? Or say the application aborts for some unknown reason, how would it pick up where it left off last?
There's usually a header and footer around the payload that's in the log file. Maybe an id field in the content as well. Not sure yet though about the id field being there.
I also imagined maybe saving the lines read count somewhere to maybe use that as bookmark.
For obvious reasons reading the whole content of the file, as well as removing lines from the log files (after loading them into your application) is out of question.
What I can think of as a partial solution is having a small database (probable something much smaller than a full-blown MySQL/MS SQL/PostgreSQL instance) and populating table with what has been read from the log file. I am pretty sure that even if there is power cut off and then the machine is booted again, most of the relational databases should be able to restore it's state with ease. This solution requires some data that could be used to identify the row from the log file (for example: exact time of the action logged, machine on which the action has taken place etc.)
Well, you will have to figure out your magic for your particular case yourself. If you are going to use well-known text encoding it may be pretty simple thoght. Look toward System.IO.StreamReader and it's ReadLine(), DiscardBufferedData() methods and BaseStream property. You should be able to remember your last position in the file and rewind to that position later and start reading again, given that you are sure that file is only appended. There are other things to consider though and there is no single universal answer to this.
Just as a naive example (you may still need to adjust a lot to make it work):
static void Main(string[] args)
{
string filePath = #"c:\log.txt";
using (var stream = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
using (var streamReader = new StreamReader(stream,Encoding.Unicode))
{
long pos = 0;
if (File.Exists(#"c:\log.txt.lastposition"))
{
string strPos = File.ReadAllText(#"c:\log.txt.lastposition");
pos = Convert.ToInt64(strPos);
}
streamReader.BaseStream.Seek(pos, SeekOrigin.Begin); // rewind to last set position.
streamReader.DiscardBufferedData(); // clearing buffer
for(;;)
{
string line = streamReader.ReadLine();
if( line==null) break;
ProcessLine(line);
}
// pretty sure when everything is read position is at the end of file.
File.WriteAllText(#"c:\log.txt.lastposition",streamReader.BaseStream.Position.ToString());
}
}
}
I think you will find the File.ReadLines(filename) function in conjuction with LINQ will be very handy for something like this. ReadAllLines() will load the entire text file into memory as a string[] array, but ReadLines will allow you to begin enumerating the lines immediately as it traverses through the file. This not only saves you time but keeps the memory usage very low as it is processing each line one at a time. Using statements are important because if this program is interrupted it will close the filestreams flushing the writer and saving unwritten content to the file. Then when it starts up it will skip all the files that are already read.
int readCount = File.ReadLines("readLogs.txt").Count();
using (FileStream readLogs = new FileStream("readLogs.txt", FileMode.Append))
using (StreamWriter writer = new StreamWriter(readLogs))
{
IEnumerable<string> lines = File.ReadLines(bigLogFile.txt).Skip(readCount);
foreach (string line in lines)
{
// do something with line or batch them if you need more than one
writer.WriteLine(line);
}
}
As MaciekTalaska mentioned, I would strongly recommend using a database if this is something written to 24/7 and will get quite large. File systems are simply not equipped to handle such volume and you will spend a lot of time trying to invent solutions where a database could do it in a breeze.
Is there a reason why it logs to a file? Files are great because they are simple to use and, being the lowest common denominator, there is relatively little that can go wrong. However, files are limited. As you say, there's no guarantee a write to the file will be complete when you read the file. Multiple applications writing to the log can interfere with each other. There is no easy sorting or filtering mechanism. Log files can grow very big very quickly and there's no easy way to move old events (say those more than 24 hours old) into separate files for backup and retention.
Instead, I would considering writing the logs to a database. The table structure can be very simple but you get the advantage of transactions (so you can extract or backup with ease) and search, sort and filter using an almost universally understood syntax. If you are worried about load spikes, use a message queue, like http://msdn.microsoft.com/en-us/library/ms190495.aspx for SQL Server.
To make the transition easier, consider using a logging framework like log4net. It abstracts much of this away from your code.
Another alternative is to use a system like syslog or, if you have multiple servers and a large volume of logs, flume. By moving the log files away the source computer, you can store them or inspect them on a different machine far more effectively. However, these are probably overkill for your current problem.
I have a list of large text files to process. I wonder which is the fastest method, because reading line by line is slow.
I have something like that:
int cnt = this.listView1.Items.Count;
for (int i = 0; i < this.listView1.Items.Count; i++)
{
FileStream fs = new FileStream(this.listView1.Items[i].Text.ToString(), FileMode.Open, FileAccess.Read);
using (StreamReader reader = new StreamReader(fs))
while (reader.Peek() != -1)
{
//code part
}
}
I read about using blocks(like 100k lines each) via backgroundworkers with multiple threads would help, but I don't know how to implement it. Or if you have better ideas to improve the performance ... your expert advice would be appreciated.
First you need to decide what is your bottleneck - I/O (reading the files) or CPU (processing them). If it's I/O, reading multiple files concurrently is not going to help you much, the most you can achieve is have one thread read files, and another process them. The processing thread will be done before the next file is available.
I agree with #asawyer, if it's only 100MB, you should read the file entirely into memory in one swoop. You might as well read 5 of them entirely into memory, it's really not a big deal.
EDIT: After realizing all the files are on a single hard-drive, and that processing takes longer than reading the file.
You should have on thread reading the files sequentially. Once a file is read, fire up another thread that handles the processing, and start reading the second file in the first thread. Once the second file is read, fire up another thread, and so on.
You should make sure you don't fire more processing threads than the numbers of cores you have, but for starters just use the thread-pool for this, and optimize later.
You're missing a little bit of performance, because the time you spend reading the first file is not used for any processing. This should be neglible, reading 100MBs of data to memory shouldn't take more than a few seconds.
I assume that you are processing files line by line. You also said that loading of files is faster than processing them. There are few ways you can do what you need. One for example:
Create a thread that reads files one by one, line by line. Sequentially, because when doing this in parallel you'll only hammer your HDD and possibly get worse results. You can use Queue<string> for that. Use Queue.Enqueue() to add lines you've red.
Run another thread that is processing the queue. Use Queue.Dequeue() to get (and remove) lines from beginning of your queue. Process the line and write it to the output file. Eventually you can put processed lines in another queue or list and write them at once when you finish processing.
If order of lines in output file is not important you can create as many threads as you have CPU cores (or use ThreadPool class) to do the processing (that would speed up things significantly).
[Edit]
If order of lines in the output file is important you should limit line processing to one thread. Or process them in parallel using separate threads and implement mechanism that would control output order. For example you may do that by numbering lines you read from input file (the easy way) or processing lines by each thread in chunks of n-lines and writing output chunk by chunk in the same order you started processing threads.
here is a simple threading code you can use: (.Net 4)
//firstly get file paths from listview so you won't block the UI thread
List<string> filesPaths = new List<string>();
for (int i = 0; i < this.listView1.Items.Count; i++)
{
filesPaths.Add(listView1.Items[i].Text.ToString());
}
//this foreach loop will fire 50 threads at same time to read 50 files
Parallel.ForEach(filesPaths, new ParallelOptions() { MaxDegreeOfParallelism = 50 }, (filepath, i, j) =>
{
//read file contents
string data = File.ReadAllText(filepath);
//do whatever you want with the contents
});
not tested though...
Any idea how to easily support file search patterns in your software, like **, *, ?
For example subfolder/**/?svn - search in all levels of subfolder for files/folders ending in "svn" 4 characters in total.
full description: http://nant.sourceforge.net/release/latest/help/types/fileset.html
If you load the directory as a directory info e.g.
DirectoryInfo directory = new DirectoryInfo(folder);
then do a search for files like this
IEnumerable<FileInfo> fileInfo = directory.GetFiles("*.svn", SearchOption.AllDirectories);
this should get you a list of fileInfo which you can manipulate
to get all subdirectories you can do the same e.g
IEnumerable<DirectoryInfo> dirInfo = directory.GetDirectories("*svn", SearchOption.AllDirectories);
anyway that should give a idea of how i'd do it. Also because fileInfo and dirInfo are IEnumerable you can add linq where queries etc. to filter results
A mix of regex and recursion should do the trick.
Another trick might be to spawn a thread for every folder or set of folders and have the thread proceed checking one more level down. This could be beneficial to speed up the process a bit.
The reason I say this is because that is highly io bound process to check folders etc. So many threads will allow you to submit more disk requests faster thus improving the speed.
This might sound silly, but have you considered downloading the nant source code to see how they did it?