I'm not really into multithreading so probably the question is stupid but it seems I cannot find a way to solve this problem (especially because I'm using C# and I've been using it for a month).
I have a dynamic number of directories (I got it from a query in the DB). Inside those queries there are a certain amount of files.
For each directory I need to use a method to transfer these files using FTP in a cuncurrent way because I have basically no limit in FTP max connections (not my word, it's written in the specifics).
But I still need to control the max amount of files transfered per directory. So I need to count the files I'm transfering (increment/decrement).
How could I do it? Should I use something like an array and use the Monitor class?
Edit: Framework 3.5
You can use the Semaphore class to throttle the number of concurrent files per directory. You would probably want to have one semaphore per directory so that the number of FTP uploads per directory can be controlled independently.
public class Example
{
public void ProcessAllFilesAsync()
{
var semaphores = new Dictionary<string, Semaphore>();
foreach (string filePath in GetFiles())
{
string filePathCapture = filePath; // Needed to perform the closure correctly.
string directoryPath = Path.GetDirectoryName(filePath);
if (!semaphores.ContainsKey(directoryPath))
{
int allowed = NUM_OF_CONCURRENT_OPERATIONS;
semaphores.Add(directoryPath, new Semaphore(allowed, allowed));
}
var semaphore = semaphores[directoryPath];
ThreadPool.QueueUserWorkItem(
(state) =>
{
semaphore.WaitOne();
try
{
DoFtpOperation(filePathCapture);
}
finally
{
semaphore.Release();
}
}, null);
}
}
}
var allDirectories = db.GetAllDirectories();
foreach(var directoryPath in allDirectories)
{
DirectoryInfo directories = new DirectoryInfo(directoryPath);
//Loop through every file in that Directory
foreach(var fileInDir in directories.GetFiles()) {
//Check if we have reached our max limit
if (numberFTPConnections == MAXFTPCONNECTIONS){
Thread.Sleep(1000);
}
//code to copy to FTP
//This can be Aync, when then transfer is completed
//decrement the numberFTPConnections so then next file can be transfered.
}
}
You can try something along the lines above. Note that It's just the basic logic and there are proberly better ways to do this.
Related
I am connecting my application with stock market live data provider using web socket. So when market is live and socket is open then it's giving me nearly 45000 lines in a minute. at a time I am deserializing it line by line
and then write that line into text file and also reading text file and removing first line of text file. So handling another process with socket becomes slow. So please can you help me that how should I perform that process very fast like nearly 25000 lines in a minute.
string filePath = #"D:\Aggregate_Minute_AAPL.txt";
var records = (from line in File.ReadLines(filePath).AsParallel()
select line);
List<string> str = records.ToList();
str.ForEach(x =>
{
string result = x;
result = result.TrimStart('[').TrimEnd(']');
var jsonString = Newtonsoft.Json.JsonConvert.DeserializeObject<List<LiveAMData>>(x);
foreach (var item in jsonString)
{
string value = "";
string dirPath = #"D:\COMB1\MinuteAggregates";
string[] fileNames = null;
fileNames = System.IO.Directory.GetFiles(dirPath, item.sym+"_*.txt", System.IO.SearchOption.AllDirectories);
if(fileNames.Length > 0)
{
string _fileName = fileNames[0];
var lineList = System.IO.File.ReadAllLines(_fileName).ToList();
lineList.RemoveAt(0);
var _item = lineList[lineList.Count - 1];
if (!_item.Contains(item.sym))
{
lineList.RemoveAt(lineList.Count - 1);
}
System.IO.File.WriteAllLines((_fileName), lineList.ToArray());
value = $"{item.sym},{item.s},{item.o},{item.h},{item.c},{item.l},{item.v}{Environment.NewLine}";
using (System.IO.StreamWriter sw = System.IO.File.AppendText(_fileName))
{
sw.Write(value);
}
}
}
});
How to make process fast, if application perform this then it takes nearly 3000 to 4000 symbols. and if there is no any process then it executes 25000 lines per minute. So how to increase line execution time/process with all this code ?
First you need to cleanup you code to gain more visibility, i did a quick refactor and this is what i got
const string FilePath = #"D:\Aggregate_Minute_AAPL.txt";
class SomeClass
{
public string Sym { get; set; }
public string Other { get; set; }
}
private void Something() {
File
.ReadLines(FilePath)
.AsParallel()
.Select(x => x.TrimStart('[').TrimEnd(']'))
.Select(JsonConvert.DeserializeObject<List<SomeClass>>)
.ForAll(WriteRecord);
}
private const string DirPath = #"D:\COMB1\MinuteAggregates";
private const string Separator = #",";
private void WriteRecord(List<SomeClass> data)
{
foreach (var item in data)
{
var fileNames = Directory
.GetFiles(DirPath, item.Sym+"_*.txt", SearchOption.AllDirectories);
foreach (var fileName in fileNames)
{
var fileLines = File.ReadAllLines(fileName)
.Skip(1).ToList();
var lastLine = fileLines.Last();
if (!lastLine.Contains(item.Sym))
{
fileLines.RemoveAt(fileLines.Count - 1);
}
fileLines.Add(
new StringBuilder()
.Append(item.Sym)
.Append(Separator)
.Append(item.Other)
.Append(Environment.NewLine)
.ToString()
);
File.WriteAllLines(fileName, fileLines);
}
}
}
From here should be more easy to play with List.AsParallel to check how and with what parameters the code is faster.
Also:
You are opening the write file twice
The removes are also somewhat expensive, in the index 0 is more (however, if there are few elements this could not make much difference
if(fileNames.Length > 0) is useless, use a for, if the list is empty, then he for will simply skip
You can try StringBuilder instead string interpolation
I hope this hints can help you to improve your time! and that i have not forgetting something.
Edit
We have nearly 10,000 files in our directory. So when process is
running then it's passing an error that The Process can not access the
file because it is being used by another process
Well, is there a possibility that in your process lines there is duplicated file names?
If that is the case, you could try a simple approach, a retry after some milliseconds, something like
private const int SleepMillis = 5;
private const int MaxRetries = 3;
public void WriteFile(string fileName, string[] fileLines, int retries = 0)
{
try
{
File.WriteAllLines(fileName, fileLines);
}
catch(Exception e) //Catch the special type if you can
{
if (retries >= MaxRetries)
{
Console.WriteLine("Too many tries with no success");
throw; // rethrow exception
}
Thread.Sleep(SleepMillis);
WriteFile(fileName, fileLines, ++retries); // try again
}
}
I tried to keep it simple, but there are some annotations:
- If you can make your methods async, it could be an improvement by changing the sleep for a Task.Delay, but you need to know and understand well how async works
- If the collision happens a lot, then you should try another approach, something like a concurrent map with semaphores
Second edit
In real scenario I am connecting to websocket and receiving 70,000 to
1 lac records on every minute and after that I am bifurcating those
records with live streaming data and storing in it's own file. And
that becomes slower when I am applying our concept with 11,000 files
It is a hard problem, from what i understand, you're talking about 1166 records per second, at this size the little details can become big bottlenecks.
At that phase i think it is better to think about other solutions, it could be so much I/O for the disk, could be many threads, or too few, network...
You should start by profiling the app to check where the app is spending more time to focus in that area, how much resources is using? how much resources do you have? how is the memory, processor, garbage collector, network? do you have an SSD?
You need a clear view of what is slowing you down so you can attack that directly, it will depend on a lot of things, it will be hard to help with that part :(.
There are tons of tools for profile c# apps, and many ways to attack this problem (spread the charge in several servers, use something like redis to save data really quick, some event store so you can use events....
I have made an implementation of File/Folder lookup in Google Drive v3 using their own Google API for .NET.
Code works fine, but to be honest I'm not sure if it's really on a standard-efficient way of doing this.
Logic:
I have to get to each and every folder and download specific files on it.
Structure can be A > B > C > D , basically a folder within a folder within a folder and so on.
I can't use a static-predefined directory schema as a long term solution as it can change anytime the owner wants to modify it, for now, the folders are at least 4 levels deep.
The only way I can navigate to the subfolders is to get its own Google Drive ID and use that to see its contents. It is like you need a KEY first before you can unlock/open the next subfolder.
In short, I can't do LOOKUP on subfolders content the easy way, unless there's someone that can give us better alternatives, I'll be glad to take any criticisms on my aproach and open to all your suggestions.
Thank you.
Update
Thank you all for providing links and examples,
I believed Recursion solution is the best so far on my current scenario,
And also, since this is a heavy IO Operations, I did apply ASYNC operations for downloading files and to the rest as possible, so I made sure to follow the ASYNC ALL THE WAY rule to prevent blocking.
To Call the Recursion Method
var parentID = "<folder id>";
var folderLevel = 0;
var listRequest = service.Files.List();
await MyTask(listRequest, id, count, folderLevel);
This is the Recursion Method, it will search all possible folders from the set root parent id that was defined...
private async Task RecursionTask(ListRequest listRequest, string parentId, int count, int folderLevel)
{
// This method do the Folder search
listRequest.Q = $"('{parentId}' in parents) and (mimeType = 'application/vnd.google-apps.folder') and trashed = false";
listRequest.Fields = "files(id,name)";
var filesTask = await listRequest.ExecuteAsync();
var files = filesTask.Files;
count = files.Count(); // Keep track of recursion flow
count--;
// Keep track of how deep recursion is diving on subfolders
folderLevel++;
var tasks = new List<Task>();
foreach(var file in files)
{
tasks.Add(InnerTask(file, listRequest, name, folderLevel)); // Create Array Of Tasks for IMAGE Search
if (count > 1) // Loop until I exhausted the value of count
{
// Return recursion flow
await RecursionTask(listRequest, file.Name, file.Id, count, folderLevel);
}
}
await Task.WhenAll(tasks); // Wait all tasks to finish
}
This is the innerTask that will handle the Downloading of drive files
private async Task InnerTask(File file, ListRequest listRequest, string name,int folderLevel)
{
// This method do the IMAGE SEARCH
listRequest.Q = $"('{file.Id}' in parents) and (mimeType = 'image/jpeg' or mimeType = 'image/png')";
listRequest.Fields = "files(id,name)";
var subFiles = await listRequest.ExecuteAsync();
foreach (var subFile in subFiles.Files)
{
// Do Async task for downloading images on Google Drive
}
}
Currently I have a .txt file of about 170,000 jpg file names and I read them all into a List (fileNames).
I want to search ONE folder (this folder has sub-folders) to check if each file in fileNames exists in this folder and if it does, copy it to a new folder.
I was making a rough estimate but each search and copy for each file name in fileNames takes about .5 seconds. So 170,000 seconds is roughly 48 hours so divide by 2 that will take about 24 hours for my app to have searched for every single file name using 1 thread! Obviously this is too long so I want to narrow this down and speed the process up. What is the best way to go about doing this using multi-threading?
Currently I was thinking of making 20 separate threads and splitting my list (fileNames) into 20 different lists and search for the files simultaneously. For example I would have 20 different threads executing the below at the same time:
foreach (string str in fileNames)
{
foreach (var file in Directory.GetFiles(folderToCheckForFileName, str, SearchOption.AllDirectories))
{
string combinedPath = Path.Combine(newTargetDirectory, Path.GetFileName(file));
if (!File.Exists(combinedPath))
{
File.Copy(file, combinedPath);
}
}
}
UPDATED TO SHOW MY SOLUTION BELOW:
string[] folderToCheckForFileNames = Directory.GetFiles("C:\\Users\\Alex\\Desktop\\ok", "*.jpg", SearchOption.AllDirectories);
foreach(string str in fileNames)
{
Parallel.ForEach(folderToCheckForFileNames, currentFile =>
{
string filename = Path.GetFileName(currentFile);
if (str == filename)
{
string combinedPath = Path.Combine(targetDir, filename);
if (!File.Exists(combinedPath))
{
File.Copy(currentFile, combinedPath);
Console.WriteLine("FOUND A MATCH AND COPIED" + currentFile);
}
}
}
);
}
Thank you everyone for your contributions! Greatly Appreciated!
Instead of using ordinary foreach statement in doing your search, you should use parallel linq. Parallel linq combines the simplicity and readability of LINQ syntax with the power of parallel programming. Just like code that targets the Task Parallel Library. This will shield you from low level thread manipulation and probable exceptions (hard to find/debug exceptions) while splitting your work among many threads. So you might do something like this:
fileNames.AsParallel().ForAll(str =>
{
var files = Directory.GetFiles(folderToCheckForFileName, str, SearchOption.AllDirectories);
files.AsParallel().ForAll(file =>
{
if (!string.IsNullOrEmpty(file))
{
string combinedPath = Path.Combine(newTargetDirectory, Path.GetFileName(file));
if (!File.Exists(combinedPath))
{
File.Copy(file, combinedPath);
}
}
});
});
20 different threads won't help if your computer has fewer than 20 cores. In fact, it can make the process slower because you will 1) have to spend time context switching between each thread (which is your CPU's way of emulating more than 1 thread / core) and 2) a Thread in .NET reserves 1 MB for its stack, which is pretty hefty.
Instead, try dividing your I/O into async workloads, using Task.Run for the CPU-bound / intensive parts. Also, keep your number of Tasks to maybe 4 to 8 at the max.
Sample code:
var tasks = new Task[8];
var names = fileNames.ToArray();
for (int i = 0; i < tasks.Length; i++)
{
int index = i;
tasks[i] = Task.Run(() =>
{
for (int current = index; current < names.Length; current += 8)
{
// execute the workload
string str = names[current];
foreach (var file in Directory.GetFiles(folderToCheckForFileName, str, SearchOption.AllDirectories))
{
string combinedPath = Path.Combine(newTargetDirectory, Path.GetFileName(file));
if (!File.Exists(combinedPath))
{
File.Copy(file, combinedPath);
}
}
}
});
}
Task.WaitAll(tasks);
Following is the code which processes about 10000 files.
var files = Directory.GetFiles(directorypath, "*.*", SearchOption.AllDirectories).Where(
name => !name.EndsWith(".gif") && !name.EndsWith(".jpg") && !name.EndsWith(".png")).ToList();
Parallel.ForEach(files,Countnumberofwordsineachfile);
And the Countnumberofwordsineachfile function prints the number of words in each file into the text.
Whenever i implement Parallel.ForEach(), i miss about 4-5 files everytime while processing.
Can anyone suggest as to why this happens?
public void Countnumberofwordsineachfile(string filepath)
{
string[] arrwordsinfile = Regex.Split(File.ReadAllText(filepath).Trim(), #"\s+");
Charactercount = Convert.ToInt32(arrwordsinfile.Length);
filecontent.AppendLine(filepath + "=" + Charactercount);
}
fileContent is probably not threadsafe. So if two (or more) tasks attempt to append to it at the same time one will win, the other will not. You need to remember to either lock the sections that are shared, or don't used shared data.
This is probably the easiest solution for your code. Locking, synchronises access (other tasks have to queue up to access the locked section) so it will slow down the algorithm, but since this is very short compared to the part that counts the words is likely to be then it isn't really going to be much of an issue.
private object myLock = new object();
public void Countnumberofwordsineachfile(string filepath)
{
string[] arrwordsinfile = Regex.Split(File.ReadAllText(filepath).Trim(), #"\s+");
Charactercount = Convert.ToInt32(arrwordsinfile.Length);
lock(myLock)
{
filecontent.AppendLine(filepath + "=" + Charactercount);
}
}
The cause has already been found, here is an alternative implementation:
//Parallel.ForEach(files,Countnumberofwordsineachfile);
var fileContent = files
.AsParallel()
.Select(f=> f + "=" + Countnumberofwordsineachfile(f));
and that requires a more useful design for the count method:
// make this an 'int' function, more reusable as well
public int Countnumberofwordsineachfile(string filepath)
{ ...; return characterCount; }
But do note that going parallel won't help you much here, your main function (ReadAllText) is I/O bound so you will most likely see a degradation from using AsParallel().
The better option is to use Directory.EnumerateFiles and then collect the results without parallelism:
var files = Directory.EnumerateFiles(....);
var fileContent = files
//.AsParallel()
.Select(f=> f + "=" + Countnumberofwordsineachfile(f));
I want to be able to get the size of one of the local directories using C#. I'm trying to avoid the following (pseudo like code), although in the worst case scenario I will have to settle for this:
int GetSize(Directory)
{
int Size = 0;
foreach ( File in Directory )
{
FileInfo fInfo of File;
Size += fInfo.Size;
}
foreach ( SubDirectory in Directory )
{
Size += GetSize(SubDirectory);
}
return Size;
}
Basically, is there a Walk() available somewhere so that I can walk through the directory tree? Which would save the recursion of going through each sub-directory.
A very succinct way to get a folder size in .net 4.0 is below. It still suffers from the limitation of having to traverse all files recursively, but it doesn't load a potentially huge array of filenames, and it's only two lines of code. Make sure to use the namespaces System.IO and System.Linq.
private static long GetDirectorySize(string folderPath)
{
DirectoryInfo di = new DirectoryInfo(folderPath);
return di.EnumerateFiles("*.*", SearchOption.AllDirectories).Sum(fi => fi.Length);
}
If you use Directory.GetFiles you can do a recursive seach (using SearchOption.AllDirectories), but this is a bit flaky anyway (especially if you don't have access to one of the sub-directories) - and might involve a huge single array coming back (warning klaxon...).
I'd be happy with the recursion approach unless I could show (via profiling) a bottleneck; and then I'd probably switch to (single-level) Directory.GetFiles, using a Queue<string> to emulate recursion.
Note that .NET 4.0 introduces some enumerator-based file/directory listing methods which save on the big arrays.
Here my .NET 4.0 approach
public static long GetFileSizeSumFromDirectory(string searchDirectory)
{
var files = Directory.EnumerateFiles(searchDirectory);
// get the sizeof all files in the current directory
var currentSize = (from file in files let fileInfo = new FileInfo(file) select fileInfo.Length).Sum();
var directories = Directory.EnumerateDirectories(searchDirectory);
// get the size of all files in all subdirectories
var subDirSize = (from directory in directories select GetFileSizeSumFromDirectory(directory)).Sum();
return currentSize + subDirSize;
}
Or even nicer:
// get IEnumerable from all files in the current dir and all sub dirs
var files = Directory.EnumerateFiles(searchDirectory,"*",SearchOption.AllDirectories);
// get the size of all files
long sum = (from file in files let fileInfo = new FileInfo(file) select fileInfo .Length).Sum();
As Gabriel pointed out this will fail if you have a restricted directory under the searchDirectory!
You could hide your recursion behind an extension method (to avoid the issues Marc has highlighted with the GetFiles() method):
public static class UserExtension
{
public static IEnumerable<FileInfo> Walk(this DirectoryInfo directory)
{
foreach(FileInfo file in directory.GetFiles())
{
yield return file;
}
foreach(DirectoryInfo subDirectory in directory.GetDirectories())
{
foreach(FileInfo file in subDirectory.Walk())
{
yield return file;
}
}
}
}
(You probably want to add some exception handling to this for protected folders etc.)
Then:
using static UserExtension;
long totalSize = 0L;
var startFolder = new DirectoryInfo("<path to folder>");
// iteration
foreach(FileInfo file in startFolder.Walk())
{
totalSize += file.Length;
}
// linq
totalSize = di.Walk().Sum(s => s.Length);
Basically the same code, but maybe a little neater...
First, forgive my poor english ;o)
I had a problem that took me to this page : enumerate files of a directory and his subdirectories without blocking on an UnauthorizedAccessException, and, like the new method of .Net 4 DirectoryInfo.Enumerate..., get the first result before the end of the entire query.
With the help of various examples found here and there on the web, I finally write this method :
public static IEnumerable<FileInfo> EnumerateFiles_Recursive(this DirectoryInfo directory, string searchPattern, SearchOption searchOption, Func<DirectoryInfo, Exception, bool> handleExceptionAccess)
{
Queue<DirectoryInfo> subDirectories = new Queue<DirectoryInfo>();
IEnumerable<FileSystemInfo> entries = null;
// Try to get an enumerator on fileSystemInfos of directory
try
{
entries = directory.EnumerateFileSystemInfos(searchPattern, SearchOption.TopDirectoryOnly);
}
catch (Exception e)
{
// If there's a callback delegate and this delegate return true, we don't throw the exception
if (handleExceptionAccess == null || !handleExceptionAccess(directory, e))
throw;
// If the exception wasn't throw, we make entries reference an empty collection
entries = EmptyFileSystemInfos;
}
// Yield return file entries of the directory and enqueue the subdirectories
foreach (FileSystemInfo entrie in entries)
{
if (entrie is FileInfo)
yield return (FileInfo)entrie;
else if (entrie is DirectoryInfo)
subDirectories.Enqueue((DirectoryInfo)entrie);
}
// If recursive search, we make recursive call on the method to yield return entries of the subdirectories.
if (searchOption == SearchOption.AllDirectories)
{
DirectoryInfo subDir = null;
while (subDirectories.Count > 0)
{
subDir = subDirectories.Dequeue();
foreach (FileInfo file in subDir.EnumerateFiles_Recursive(searchPattern, searchOption, handleExceptionAccess))
{
yield return file;
}
}
}
else
subDirectories.Clear();
}
I use a Queue and a recursive method to keep traditional order (content of directory and then content of first subdirectory and his own subdirectories and then content of the second...). The parameter "handleExceptionAccess" is just a function call when an exception is thrown with a directory; the function must return true to indicate that the exception must be ignored.
With this methode, you can write :
DirectoryInfo dir = new DirectoryInfo("c:\\temp");
long size = dir.EnumerateFiles_Recursive("*", SearchOption.AllDirectories, (d, ex) => true).Sum(f => f.Length);
And here we are : all exception when trying to enumerate a directory will be ignore !
Hope this help
Lionel
PS : for a reason I can't explain, my method is more quick than the framework 4 one...
PPS : you can get my test solutions with source for those methods : here TestDirEnumerate. I write EnumerateFiles_Recursive, EnumerateFiles_NonRecursive (use a queue to avoid recursion) and EnumerateFiles_NonRecursive_TraditionalOrder (use a stack of queue to avoid recursion and keep traditional order). Keep those 3 methods has no interest, I write them only for test the best one. I think to keep only the last one.
I also wrote the equivalent for EnumerateFileSystemInfos and EnumerateDirectories.
Have a look at this post:
http://social.msdn.microsoft.com/forums/en-US/vbgeneral/thread/eed54ebe-facd-4305-b64b-9dbdc65df04e
Basically there is no clean .NET way, but there is a quite straightforward COM approach so if you're happy with using COM interop and being tied to Windows, this could work for you.
the solution is already here https://stackoverflow.com/a/12665904/1498669
as in the duplicate How do I Get Folder Size in C#? shown -> you can do this also in c#
first, add the COM reference "Microsoft Scripting Runtime" to your project and use:
var fso = new Scripting.FileSystemObject();
var folder = fso.GetFolder(#"C:\Windows");
double sizeInBytes = folder.Size;
// cleanup COM
System.Runtime.InteropServices.Marshal.ReleaseComObject(folder);
System.Runtime.InteropServices.Marshal.ReleaseComObject(fso);
remember to cleanup the COM references
I've been looking some time ago for a function like the one you ask for and from what I've found on the Internet and in MSDN forums, there is no such function.
The recursive way is the only I found to obtain the size of a Folder considering all the files and subfolders that contains.
You should make it easy on yourself. Make a method and passthrough the location of the directory.
private static long GetDirectorySize(string location) {
return new DirectoryInfo(location).GetFiles("*.*", SearchOption.AllDirectories).Sum(file => file.Length);
}
-G