C# Threading - Reading and hashing multiple files concurrently, easiest method?

C# Threading - Reading and hashing multiple files concurrently, easiest method? - c#

I've been trying to get what I believe to be the simplest possible form of threading to work in my application but I just can't do it.
What I want to do: I have a main form with a status strip and a progress bar on it. I have to read something between 3 and 99 files and add their hashes to a string[] which I want to add to a list of all files with their respective hashes. Afterwards I have to compare the items on that list to a database (which comes in text files).
Once all that is done, I have to update a textbox in the main form and the progressbar to 33%; mostly I just don't want the main form to freeze during processing.
The files I'm working with always sum up to 1.2GB (+/- a few MB), meaning I should be able to read them into byte[]s and process them from there (I have to calculate CRC32, MD5 and SHA1 of each of those files so that should be faster than reading all of them from a HDD 3 times).
Also I should note that some files may be 1MB while another one may be 1GB. I initially wanted to create 99 threads for 99 files but that seems not wise, I suppose it would be best to reuse threads of small files while bigger file threads are still running. But that sounds pretty complicated to me so I'm not sure if that's wise either.
So far I've tried workerThreads and backgroundWorkers but neither seem to work too well for me; at least the backgroundWorkers worked SOME of the time, but I can't even figure out why they won't the other times... either way the main form still froze.
Now I've read about the Task Parallel Library in .NET 4.0 but I thought I should better ask someone who knows what he's doing before wasting more time on this.
What I want to do looks something like this (without threading):
List<string[]> fileSpecifics = new List<string[]>();
int fileMaxNumber = 42; // something between 3 and 99, depending on file set
for (int i = 1; i <= fileMaxNumber; i++)
{
string fileName = "C:\\path\\to\\file" + i.ToString("D2") + ".ext"; // file01.ext - file99.ext
string fileSize = new FileInfo(fileName).Length.ToString();
byte[] file = File.ReadAllBytes(fileName);
// hash calculations (using SHA1CryptoServiceProvider() etc., no problems with that so I'll spare you that, return strings)
file = null; // I didn't yet check if this made any actual difference but I figured it couldn't hurt
fileSpecifics.Add(new string[] { fileName, fileSize, fileCRC, fileMD5, fileSHA1 });
}
// look for files in text database mentioned above, i.e. first check for "file bundles" with the same amount of files I have here; then compare file sizes, then hashes
// again, no problems with that so I'll spare you that; the database text files are pretty small so parsing them doesn't need to be done in an extra thread.
Would anybody be kind enough to point me in the right direction? I'm looking for the easiest way to read and hash those files quickly (I believe the hashing takes some time in which other files could already be read) and save the output to a string[], without the main form freezing, nothing more, nothing less.
I'm thankful for any input.
EDIT to clarify: by "backgroundWorkers working some of the time" I meant that (for the very same set of files), maybe the first and fourth execution of my code produces the correct output and the UI unfreezes within 5 seconds, for the second, third and fifth execution it freezes the form (and after 60 seconds I get an error message saying some thread didn't respond within that time frame) and I have to stop execution via VS.
Thanks for all your suggestions and pointers, as you all have correctly guessed I'm completely new to threading and will have to read up on the great links you guys posted.
Then I'll give those methods a try and flag the answer that helped me the most. Thanks again!

With .NET Framework 4.X
Use Directory.EnumerateFiles Method for efficient/lazy files enumeration
Use Parallel.For() to delegate parallelism work to PLINQ framework or use TPL to delegate single Task per pipeline Stage
Use Pipelines pattern to pipeline following stages: calculating hashcodes, compare with pattern, update UI
To avoid UI freeze use appropriate techniques: for WPF use Dispatcher.BeginInvoke(), for WinForms use Invoke(), see this SO answer
Considering that all this stuff has UI it might be useful adding some cancellation feature to abandon long running operation if needed, take a look at the CreateLinkedTokenSource class which allows triggering CancellationToken from the "external scope"
I can try adding an example but it's worth do it yourself so you would learn all this stuff rather than simply copy/paste - > got it working -> forgot about it.
PS: Must read - Pipelines paper at MSDN
TPL specific pipeline implementation
Pipeline pattern implementation: three stages: calculate hash, match, update UI
Three tasks, one per stage
Two Blocking Queues
//
// 1) CalculateHashesImpl() should store all calculated hashes here
// 2) CompareMatchesImpl() should read input hashes from this queue
// Tuple.Item1 - hash, Typle.Item2 - file path
var calculatedHashes = new BlockingCollection<Tuple<string, string>>();
// 1) CompareMatchesImpl() should store all pattern matching results here
// 2) SyncUiImpl() method should read from this collection and update
// UI with available results
var comparedMatches = new BlockingCollection<string>();
var factory = new TaskFactory(TaskCreationOptions.LongRunning,
TaskContinuationOptions.None);
var calculateHashesWorker = factory.StartNew(() => CalculateHashesImpl(...));
var comparedMatchesWorker = factory.StartNew(() => CompareMatchesImpl(...));
var syncUiWorker= factory.StartNew(() => SyncUiImpl(...));
Task.WaitAll(calculateHashesWorker, comparedMatchesWorker, syncUiWorker);
CalculateHashesImpl():
private void CalculateHashesImpl(string directoryPath)
{
foreach (var file in Directory.EnumerateFiles(directoryPath))
{
var hash = CalculateHashTODO(file);
calculatedHashes.Add(new Tuple<string, string>(hash, file.Path));
}
}
CompareMatchesImpl():
private void CompareMatchesImpl()
{
foreach (var hashEntry in calculatedHashes.GetConsumingEnumerable())
{
// TODO: obviously return type is up to you
string matchResult = GetMathResultTODO(hashEntry.Item1, hashEntry.Item2);
comparedMatches.Add(matchResult);
}
}
SyncUiImpl():
private void UpdateUiImpl()
{
foreach (var matchResult in comparedMatches.GetConsumingEnumerable())
{
// TODO: track progress in UI using UI framework specific features
// to do not freeze it
}
}
TODO: Consider using CancellationToken as a parameter for all GetConsumingEnumerable() calls so you easily can stop a pipeline execution when needed.

First off, you should be using a higher level of abstraction to solve this problem. You have a bunch of tasks to complete, so use the "task" abstraction. You should be using the Task Parallel Library to do this sort of thing. Let the TPL deal with the question of how many worker threads to create -- the answer could be as low as one if the work is gated on I/O.
If you do want to do your own threading, some good advice:
Do not ever block on the UI thread. That's is what is freezing your application. Come up with a protocol by which working threads can communicate with your UI thread, which then does nothing except for responding to UI events. Remember that methods of user interface controls like task completion bars must never be called by any other thread other than the UI thread.
Do not create 99 threads to read 99 files. That's like getting 99 pieces of mail and hiring 99 assistants to write responses: an extraordinarily expensive solution to a simple problem. If your work is CPU intensive then there is no point in "hiring" more threads than you have CPUs to service them. (That's like hiring 99 assistants in an office that only has four desks. The assistants spend most of their time waiting for a desk to sit at instead of reading your mail.) If your work is disk-intensive then most of those threads are going to be idle most of the time waiting for the disk, which is an even bigger waste of resources.

First, I hope you are using a built-in library for calculating hashes. It's possible to write your own, but it's far safer to use something that has been around for a while.
You may need only create as many threads as CPUs if your process is CPU intensive. If it is bound by I/O, you might be able to get away with more threads.
I do not recommend loading the entire file into memory. Your hashing library should support updating a chunk at a time. Read a chunk into memory, use it to update the hashes of each algorighm, read the next chunk, and repeat until end of file. The chunked approach will help lower your program's memory demands.
As others have suggested, look into the Task Parallel Library, particularly Data Parallelism. It might be as easy as this:
Parallel.ForEach(fileSpecifics, item => CalculateHashes(item));

Check out TPL Dataflow. You can use a throttled ActionBlock which will manage the hard part for you.

If my understanding that you are looking to perform some tasks in the background and not block your UI, then the UI BackgroundWorker would be an appropriate choice. You mentioned that you got it working some of the time, so my recommendation would be to take what you had in a semi-working state, and improve upon it by tracking down the failures. If my hunch is correct, your worker was throwing an exception, which it does not appear you are handling in your code. Unhandled exceptions that bubble out of their containing threads make bad things happen.

This code hashing one file (stream) using two tasks - one for reading, second for hashing, for more robust way you should read more chunks forward.
Because bandwidth of processor is much higher than of disk, unless you use some high speed Flash drive you gain nothing from hashing more files concurrently.
public void TransformStream(Stream a_stream, long a_length = -1)
{
Debug.Assert((a_length == -1 || a_length > 0));
if (a_stream.CanSeek)
{
if (a_length > -1)
{
if (a_stream.Position + a_length > a_stream.Length)
throw new IndexOutOfRangeException();
}
if (a_stream.Position >= a_stream.Length)
return;
}
System.Collections.Concurrent.ConcurrentQueue<byte[]> queue =
new System.Collections.Concurrent.ConcurrentQueue<byte[]>();
System.Threading.AutoResetEvent data_ready = new System.Threading.AutoResetEvent(false);
System.Threading.AutoResetEvent prepare_data = new System.Threading.AutoResetEvent(false);
Task reader = Task.Factory.StartNew(() =>
{
long total = 0;
for (; ; )
{
byte[] data = new byte[BUFFER_SIZE];
int readed = a_stream.Read(data, 0, data.Length);
if ((a_length == -1) && (readed != BUFFER_SIZE))
data = data.SubArray(0, readed);
else if ((a_length != -1) && (total + readed >= a_length))
data = data.SubArray(0, (int)(a_length - total));
total += data.Length;
queue.Enqueue(data);
data_ready.Set();
if (a_length == -1)
{
if (readed != BUFFER_SIZE)
break;
}
else if (a_length == total)
break;
else if (readed != BUFFER_SIZE)
throw new EndOfStreamException();
prepare_data.WaitOne();
}
});
Task hasher = Task.Factory.StartNew((obj) =>
{
IHash h = (IHash)obj;
long total = 0;
for (; ; )
{
data_ready.WaitOne();
byte[] data;
queue.TryDequeue(out data);
prepare_data.Set();
total += data.Length;
if ((a_length == -1) || (total < a_length))
{
h.TransformBytes(data, 0, data.Length);
}
else
{
int readed = data.Length;
readed = readed - (int)(total - a_length);
h.TransformBytes(data, 0, data.Length);
}
if (a_length == -1)
{
if (data.Length != BUFFER_SIZE)
break;
}
else if (a_length == total)
break;
else if (data.Length != BUFFER_SIZE)
throw new EndOfStreamException();
}
}, this);
reader.Wait();
hasher.Wait();
}
Rest of code here: http://hashlib.codeplex.com/SourceControl/changeset/view/71730#514336

Related

Multiple users writing at the same file

I have a project which is a Web API project, my project is accessed by multiple users (i mean a really-really lot of users). When my project being accessed from frontend (web page using HTML 5), and user doing something like updating or retrieving data, the backend app (web API) will write a single log file (a .log file but the content is JSON).
The problem is, when being accessed by multiple users, the frontend became unresponsive (always loading). The problem is in writing process of the log file (single log file being accessed by a really-really lot of users). I heard that using a multi threading technique can solve the problem, but i don't know which method. So, maybe anyone can help me please.
Here is my code (sorry if typo, i use my smartphone and mobile version of stack overflow):
public static void JsonInputLogging<T>(T m, string methodName)
{
MemoryStream ms = new MemoryStream();
DataContractJsonSerializer ser = new
DataContractJsonSerializer(typeof(T));
ser.WriteObject(ms, m);
string jsonString = Encoding.UTF8.GetString(ms.ToArray());
ms.Close();
logging("MethodName: " + methodName + Environment.NewLine + jsonString.ToString());
}
public static void logging (string message)
{
string pathLogFile = "D:\jsoninput.log";
FileInfo jsonInputFile = new FileInfo(pathLogFile);
if (File.Exists(jsonInputFile.ToString()))
{
long fileLength = jsonInputFile.Length;
if (fileLength > 1000000)
{
File.Move(pathLogFile, pathLogFile.Replace(*some new path*);
}
}
File.AppendAllText(pathLogFile, *some text*);
}

You have to understand some internals here first. For each [x] users, ASP.Net will use a single worker process. One worker process holds multiple threads. If you're using multiple instances on the cloud, it's even worse because then you also have multiple server instances (I assume this ain't the case).
A few problems here:
You have multiple users and therefore multiple threads.
Multiple threads can deadlock each other writing the files.
You have multiple appdomains and therefore multiple processes.
Multiple processes can lock out each other
Opening and locking files
File.Open has a few flags for locking. You can basically lock files exclusively per process, which is a good idea in this case. A two-step approach with Exists and Open won't help, because in between another worker process might do something. Bascially the idea is to call Open with write-exclusive access and if it fails, try again with another filename.
This basically solves the issue with multiple processes.
Writing from multiple threads
File access is single threaded. Instead of writing your stuff to a file, you might want to use a separate thread to do the file access, and multiple threads that tell the thing to write.
If you have more log requests than you can handle, you're in the wrong zone either way. In that case, the best way to handle it for logging IMO is to simply drop the data. In other words, make the logger somewhat lossy to make life better for your users. You can use the queue for that as well.
I usually use a ConcurrentQueue for this and a separate thread that works away all the logged data.
This is basically how to do this:
// Starts the worker thread that gets rid of the queue:
internal void Start()
{
loggingWorker = new Thread(LogHandler)
{
Name = "Logging worker thread",
IsBackground = true,
Priority = ThreadPriority.BelowNormal
};
loggingWorker.Start();
}
We also need something to do the actual work and some variables that are shared:
private Thread loggingWorker = null;
private int loggingWorkerState = 0;
private ManualResetEventSlim waiter = new ManualResetEventSlim();
private ConcurrentQueue<Tuple<LogMessageHandler, string>> queue =
new ConcurrentQueue<Tuple<LogMessageHandler, string>>();
private void LogHandler(object o)
{
Interlocked.Exchange(ref loggingWorkerState, 1);
while (Interlocked.CompareExchange(ref loggingWorkerState, 1, 1) == 1)
{
waiter.Wait(TimeSpan.FromSeconds(10.0));
waiter.Reset();
Tuple<LogMessageHandler, string> item;
while (queue.TryDequeue(out item))
{
writeToFile(item.Item1, item.Item2);
}
}
}
Basically this code enables you to work away all the items from a single thread using a queue that's shared across threads. Note that ConcurrentQueue doesn't use locks for TryDequeue, so clients won't feel any pain because of this.
Last thing that's needed is to add stuff to the queue. That's the easy part:
public void Add(LogMessageHandler l, string msg)
{
if (queue.Count < MaxLogQueueSize)
{
queue.Enqueue(new Tuple<LogMessageHandler, string>(l, msg));
waiter.Set();
}
}
This code will be called from multiple threads. It's not 100% correct because Count and Enqueue don't necessarily have to be called in a consistent way - but for our intents and purposes it's good enough. It also doesn't lock in the Enqueue and the waiter will ensure that the stuff is removed by the other thread.
Wrap all this in a singleton pattern, add some more logic to it, and your problem should be solved.

That can be problematic, since every client request handled by new thread by default anyway. You need some "root" object that is known across the project (don't think you can achieve this in static class), so you can lock on it before you access the log file. However, note that it will basically serialize the requests, and probably will have a very bad effect on performance.

No multi-threading does not solve your problem. How are multiple threads supposed to write to the same file at the same time? You would need to care about data consistency and I don't think that's the actual problem here.
What you search is asynchronous programming. The reason your GUI becomes unresponsive is, that it waits for the tasks to complete. If you know, the logger is your bottleneck then use async to your advantage. Fire the log method and forget about the outcome, just write the file.
Actually I don't really think your logger is the problem. Are you sure there is no other logic which blocks you?

CPU-greedy loop when streaming music

To give some context, I'm working on an opensource alternative desktop Spotify client, with accessibility at it's core. You'll also see some NAudio in here.
I'm noticing pretty intense CPU usage as soon as playback starts. Even when paused, the CPU is high.
I ran Visual Studio's inbuilt profiler to try and shed some light on any resource hogs that might be occuring. As I suspected, the problem wasin my playback manager's streaming loop.
The code that the profiler flags as one of the most sample-rich is as follows:
const int secondsToBuffer = 3;
private void GetStreaming(object state)
{
this.fullyDownloaded = false;
// secondsToBuffer is an integer to represent how many seconds we should buffer up at once to prevent choppy playback on slow connections
try
{
do
{
if (bufferedWaveProvider == null)
{
this.bufferedWaveProvider = new BufferedWaveProvider(new WaveFormat(44100, 2));
this.bufferedWaveProvider.BufferDuration = TimeSpan.FromSeconds(20); // allow us to get well ahead of ourselves
Logger.WriteDebug("Creating buffered wave provider");
this.gatekeeper.MinimumSampleSize = bufferedWaveProvider.WaveFormat.AverageBytesPerSecond * secondsToBuffer;
}
// this bit in particular seems to be the hot point
if (bufferedWaveProvider != null && bufferedWaveProvider.BufferLength - bufferedWaveProvider.BufferedBytes < bufferedWaveProvider.WaveFormat.AverageBytesPerSecond / 4)
{
Logger.WriteDebug("Buffer getting full, taking a break");
Thread.Sleep(500);
}
// do we have at least double the buffered sample's size in free space, just in case
else if (bufferedWaveProvider.BufferLength - bufferedWaveProvider.BufferedBytes > bufferedWaveProvider.WaveFormat.AverageBytesPerSecond * (secondsToBuffer * 2))
{
var sample = gatekeeper.Read();
if (sample != null)
{
bufferedWaveProvider.AddSamples(sample, 0, sample.Length);
}
}
} while (playbackState != StreamingPlaybackState.Stopped);
Logger.WriteDebug("Playback stopped");
}
finally
{
// no post-processing work here, right?
}
}
An NAudio sample was the inspiration for my way of handling streaming in this method. To find the full file's source code, you can view it here: http://blindspot.codeplex.com/SourceControl/latest#Blindspot.Playback/PlaybackManager.cs
I'm a newbie to profiling and I'm not a year on year expert on streaming either (both might be obvious).
Is there any way I can make this loop less resource intensive. Would increasing the sleep amount in the if block where the buffer is full help? Or am I barking up the wrong tree here. It seems like it would, but I'd have thought half a second would be sufficient.
Any help gratefully received.

Basically, you've created an infinite loop until the buffer gets full. The section you've marked with
// this bit in particular seems to be the hot point
probably appears to be as the calculations in the if statement are just being repeated over and over again; can any of them be moved outside of the loop?
I'd put a Thread.Sleep(50) before the while statement to prevent thrashing and see if that makes a difference (I suspect it will).

Out of Memory Exception when using File Stream Write Byte to Output Progress Through the Console

I have the following code that throws an out of memory exception when writing large files. Is there something I'm missing?
I am not sure why it is throwing an out of memory error as I thought the Filestream would only use a maximum of 4096 bytes for the buffer? I am not entirely sure what it means by the Buffer to be honest and any advice would be appreciated.
public static async Task CreateRandomFile(string pathway, int size, IProgress<int> prog)
{
byte[] fileSize = new byte[size];
new Random().NextBytes(fileSize);
await Task.Run(() =>
{
using (FileStream fs = File.Create(pathway,4096))
{
for (int i = 0; i < size; i++)
{
fs.WriteByte(fileSize[i]);
prog.Report(i);
}
}
}
);
}
public static void p_ProgressChanged(object sender, int e)
{
int pos = Console.CursorTop;
Console.WriteLine("Progress Copied: " + e);
Console.SetCursorPosition (0, pos);
}
public static void Main()
{
Console.WriteLine("Testing CopyLearning");
//CopyFile()
Progress<int> p = new Progress<int>();
p.ProgressChanged += p_ProgressChanged;
Task ta = CreateRandomFile(#"D:\Programming\Testing\RandomFile.asd", 99999999, p);
ta.Wait();
}
Edit: the 99,999,999 was just created to make a 99MB file
Note: I have commented out prog.Report(i) and it will work fine.
It seems for some reason, the error occurs at the line
Console.writeline("Progress Copied: " + e);
I am not entirely sure why this causes an error? So the error might have been caused because of the progressEvent?
Edit 2: I have followed advice to change the code such that it reports progress every 4000 Bytes by using the following:
if (i%4000==0)
prog.Report(i);
For some reason. I am now able to write files up to 900MBs fine.
I guess the question is, why would the "Edit 2"'s code allow it to write up to 900MB just fine? Is it because it's reporting progress and writing to the console up to 4000x less than before? I didn't realize the Console would take up so much memory especially because I'm assuming all it's doing is outputting "Progress Copied"?
Edit 3:
For some reason when I change the following line as follows:
for (int i = 0; i < size; i++)
{
fs.WriteByte(fileSize[i]);
Console.Writeline(i)
prog.Report(i);
}
where there is a "Console.Writeline()" before the prog.Report(i), it would work fine and copy the file, albeit take a very long time to do so. This leads me to believe that this is a Console related issue for some reason but I am not sure as to what.

fs.WriteByte(fileSize[i]);
prog.Report(i);
You created a fire-hose problem. After deadlocks and threading races, probably the 3rd most likely problem caused by threads. And just as hard to diagnose.
Easiest to see by using the debugger's Debug + Windows + Threads window and look at thread that is executing CreateRandomFile(). With some luck, you'll see it is completed and has written all 99MB bytes. But the progress reported on the console is far behind this, having only reported 125KB bytes written, give or take.
Core issue is the way Progress<>.Report() works. It uses SynchronizationContext.Post() to invoke the ProgressChanged event handler. In a console mode app that will call ThreadPool.QueueUserWorkItem(). That's quite fast, your CreateRandomFile() method won't be bogged down much by it.
But the event handler itself is quite a lot slower, console output is not very fast. So in effect, you are adding threadpool work requests at an enormous rate, 99 million of them in a handful of seconds. No way for the threadpool scheduler to keep up, you'll have roughly 4 of them executing at the same time. All competing to write to the console as well, only one of them can acquire the underlying lock.
So it is the threadpool scheduler that causes OOM, forced to store so many work requests.
And sure, when you call Report() less frequently then the fire-hose problem is a lot less worse. Not actually that simple to ensure it never causes a problem, although directly calling Console.Write() is an obvious fix. Ultimately simple, create a usable UI that is useful to a human. Nobody likes a crazily scrolling window or a blur of text. Reporting progress no more frequently than 20 times per second is plenty good enough for the user's eyes, the console has no trouble keeping up with that.

Parallel GZip Decompression of Log Files - Tweaking MaxDegreeOfParallelism for the Highest Throughput

We have up to 30 GB of GZipped log files per day. Each file holds 100.000 lines and is between 6 and 8 MB when compressed. The simplified code in which the parsing logic has been stripped out, utilises the Parallel.ForEach loop.
The maximum number of lines processed peaks at MaxDegreeOfParallelism of 8 on the two-NUMA node, 32 logical CPU box (Intel Xeon E7-2820 # 2 GHz):
using System;
using System.Collections.Concurrent;
using System.Linq;
using System.IO;
using System.IO.Compression;
using System.Threading.Tasks;
namespace ParallelLineCount
{
public class ScriptMain
{
static void Main(String[] args)
{
int maxMaxDOP = (args.Length > 0) ? Convert.ToInt16(args[0]) : 2;
string fileLocation = (args.Length > 1) ? args[1] : "C:\\Temp\\SomeFiles" ;
string filePattern = (args.Length > 1) ? args[2] : "*2012-10-30.*.gz";
string fileNamePrefix = (args.Length > 1) ? args[3] : "LineCounts";
Console.WriteLine("Start: {0}", DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ"));
Console.WriteLine("Processing file(s): {0}", filePattern);
Console.WriteLine("Max MaxDOP to be used: {0}", maxMaxDOP.ToString());
Console.WriteLine("");
Console.WriteLine("MaxDOP,FilesProcessed,ProcessingTime[ms],BytesProcessed,LinesRead,SomeBookLines,LinesPer[ms],BytesPer[ms]");
for (int maxDOP = 1; maxDOP <= maxMaxDOP; maxDOP++)
{
// Construct ConcurrentStacks for resulting strings and counters
ConcurrentStack<Int64> TotalLines = new ConcurrentStack<Int64>();
ConcurrentStack<Int64> TotalSomeBookLines = new ConcurrentStack<Int64>();
ConcurrentStack<Int64> TotalLength = new ConcurrentStack<Int64>();
ConcurrentStack<int> TotalFiles = new ConcurrentStack<int>();
DateTime FullStartTime = DateTime.Now;
string[] files = System.IO.Directory.GetFiles(fileLocation, filePattern);
var options = new ParallelOptions() { MaxDegreeOfParallelism = maxDOP };
// Method signature: Parallel.ForEach(IEnumerable<TSource> source, Action<TSource> body)
Parallel.ForEach(files, options, currentFile =>
{
string filename = System.IO.Path.GetFileName(currentFile);
DateTime fileStartTime = DateTime.Now;
using (FileStream inFile = File.Open(fileLocation + "\\" + filename, FileMode.Open))
{
Int64 lines = 0, someBookLines = 0, length = 0;
String line = "";
using (var reader = new StreamReader(new GZipStream(inFile, CompressionMode.Decompress)))
{
while (!reader.EndOfStream)
{
line = reader.ReadLine();
lines++; // total lines
length += line.Length; // total line length
if (line.Contains("book")) someBookLines++; // some special lines that need to be parsed later
}
TotalLines.Push(lines); TotalSomeBookLines.Push(someBookLines); TotalLength.Push(length);
TotalFiles.Push(1); // silly way to count processed files :)
}
}
}
);
TimeSpan runningTime = DateTime.Now - FullStartTime;
// Console.WriteLine("MaxDOP,FilesProcessed,ProcessingTime[ms],BytesProcessed,LinesRead,SomeBookLines,LinesPer[ms],BytesPer[ms]");
Console.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7}",
maxDOP.ToString(),
TotalFiles.Sum().ToString(),
Convert.ToInt32(runningTime.TotalMilliseconds).ToString(),
TotalLength.Sum().ToString(),
TotalLines.Sum(),
TotalSomeBookLines.Sum().ToString(),
Convert.ToInt64(TotalLines.Sum() / runningTime.TotalMilliseconds).ToString(),
Convert.ToInt64(TotalLength.Sum() / runningTime.TotalMilliseconds).ToString());
}
Console.WriteLine();
Console.WriteLine("Finish: " + DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ"));
}
}
}
Here's a summary of the results, with a clear peak at MaxDegreeOfParallelism = 8:
The CPU load (shown aggregated here, most of the load was on a single NUMA node, even when DOP was in 20 to 30 range):
The only way I've found to make CPU load cross 95% mark was to split the files across 4 different folders and execute the same command 4 times, each one targeting a subset of all files.
Can someone find a bottleneck?

It's likely that one problem is the small buffer size used by the default FileStream constructor. I suggest you use a larger input buffer. Such as:
using (FileStream infile = new FileStream(
name, FileMode.Open, FileAccess.Read, FileShare.None, 65536))
The default buffer size is 4 kilobytes, which has the thread making many calls to the I/O subsystem to fill its buffer. A buffer of 64K means that you will make those calls much less frequently.
I've found that a buffer size of between 32K and 256K gives the best performance, with 64K being the "sweet spot" when I did some detailed testing a while back. A buffer size larger than 256K actually begins to reduce performance.
Also, although this is unlikely to have a major effect on performance, you probably should replace those ConcurrentStack instances with 64-bit integers and use Interlocked.Add or Interlocked.Increment to update them. It simplifies your code and removes the need to manage the collections.
Update:
Re-reading your problem description, I was struck by this statement:
The only way I've found to make CPU load cross 95% mark was to split
the files across 4 different folders and execute the same command 4
times, each one targeting a subset of all files.
That, to me, points to a bottleneck in opening files. As though the OS is using a mutual exclusion lock on the directory. And even if all the data is in the cache and there's no physical I/O required, processes still have to wait on this lock. It's also possible that the file system is writing to the disk. Remember, it has to update the Last Access Time for a file whenever it's opened.
If I/O really is the bottleneck, then you might consider having a single thread that does nothing but load files and stuff them into a BlockingCollection or similar data structure so that the processing threads don't have to contend with each other for a lock on the directory. Your application becomes a producer/consumer application with one producer and N consumers.

The reason for this is usually that threads synchronize too much.
Looking for synchronization in your code I can see heavy syncing on the collections. Your threads are pushing the lines individually. This means that each line incurs at best an interlocked operation and at worst a kernel-mode lock wait. The interlocked operations will contend heavily because all threads race to get their current line into the collection. They all try to update the same memory locations. This causes cache line pinging.
Change this to push lines in bigger chunks. Push line-arrays of 100 lines or more. The more the better.
In other words, collect results in a thread-local collection first and only rarely merge into the global results.
You might even want to get rid of the manual data pushing altogether. This is what PLINQ is made for: Streaming data concurrently. PLINQ abstracts away all the concurrent collection manipulations in a well-performing way.

I don't think Parallelizing the disk reads is helping you. In fact, this could be seriously impacting your performance by creating contention in reading from multiple areas of storage at same time.
I would restructure the program to first do a single-threaded read of raw file data into a memory stream of byte[]. Then, do a Parallel.ForEach() on each stream or buffer to decompress and count the lines.
You take an initial IO read hit up front but let the OS/hardware optimize the hopefully mostly sequential reads, then decompress and parse in memory.
Keep in mind that operations like decomprless, Encoding.UTF8.ToString(), String.Split(), etc. will use large amounts of memory, so clean up references to/dispose of old buffers as you no longer need them.
I'd be surprised if you can't cause the machine to generate some serious waste hit this way.
Hope this helps.

The problem, I think, is that you are using blocking I/O, so your threads cannot fully take advantage of parallelism.
If I understand your algorithm right (sorry, I'm more of a C++ guy) this is what you are doing in each thread (pseudo-code):
while (there is data in the file)
read data
gunzip data
Instead, a better approach would be something like this:
N = 0
read data block N
while (there is data in the file)
asyncRead data block N+1
gunzip data block N
N = N + 1
gunzip data block N
The asyncRead call does not block, so basically you have the decoding of block N happening concurrently with the reading of block N+1, so by the time you are done decoding block N you might have block N+1 ready (or close to be ready if I/O is slower than decoding).
Then it's just a matter of finding the block size that gives you the best throughput.
Good luck.

Editing File in a thread

I'm working on a xml service at the moment , which is a sum of 20+ other xml's from other site's services.
So at first it was just ;
GetherDataAndCreateXML();
But obviously getting 20+ other xml , editing and serving it takes time , so i decided to cache it for like 10 minutes and added a final.xml file with a DateTime attribute to check if it's out of date etc. So it became something like ;
var de = DateTime.Parse(x.Element("root").Attribute("DateTime").Value).AddSeconds(10.0d);
if (de >= DateTime.Now)
return finalXML();
else
{
RefreshFinalXml();
return finalXML();
}
The problem now , is that any request after that 10 minute obviously takes too much time as it's waiting for my looong RefreshFinalXml() function. So i did this;
if (ndt >= DateTime.Now)
return finalXML();
else
{
ThreadStart start = RefreshFinalXml;
var thr = new Thread(start);
thr.IsBackground = true;
thr.Start();
return finalXML();
}
This way , even at the 11th minute i simply return the old final.xml but meanwhile i start another thread to refresh current xml at the background. So after something like 13th minute , users get fresh data without any delay.
But still there is a problem with this ; it creates a new thread for every single request between 10 to 13th minutes ( while first RefreshFinalXml is still working at the background ) and obviously i can't let that happen , right? And since I don't know much about locking files and detecting if it's lock , i added a little attribute , "Updating" to my final xml ;
if (ndt >= DateTime.Now)
return finalXML();
else
{
if (final.Element("root").Attribute("Updating").Value != "True")
{
final.Element("root").SetAttributeValue("Updating", "True");
final.Save(Path);
ThreadStart start = RefreshFinalXml;
//I change Updating Attribute back to False at the end of this function , right before saving Final Xml
var thr = new Thread(start);
thr.IsBackground = true;
thr.Start();
}
return finalXML();
}
So ,
0-10 minutes = return from cache
10~13 minutes = return from cache while just one thread is refreshing final.xml
13+ minutes = returns from cache
It works and seems decent at the moment , but the question/problem is ; I'm extremely inexperienced in these kind of stuff ( xml services , threading , locks etc ) so i'm not really sure if it'll work flawlessly under tougher situations. For example , will my custom locking create problems under heavy traffic, should i switch to lock file etc.
So I'm looking for any advice/correction about this process , what would be the "best practice" etc.
Thanks in advance
Full Code : http://pastebin.com/UH94S8t6
Also apologies for my English as it's not my mother language and it gets even worse when I'm extremely sleepless/tired as I'm at the moment.
EDIT : Oh I'm really sorry but somehow i forgot to mention a crucial thing ; this is all working on Asp.Net Mvc2. I think i could have done a little better if it wasn't a web application but i think that changes many things right?

You've got a couple of options here.
Approach #1
First, you can use .NET's asychronous APIs for fetching the data. Assuming you're using HttpWebRequest you'd want to take a look at BeginGetResponse and EndGetResponse, as well as the BeginRead and EndRead methods on the Stream you get back the response.
Example
var request = WebRequest.Create("http://someurl.com");
request.BeginGetResponse(delegate (IAsyncResult ar)
{
Stream responseStream = request.EndGetResponse(ar).GetResponseStream();
// use async methods on the stream to process the data -- omitted for brevity
});
Approach #2
Another approach is to use the thread pool to do your work, rather than creating and managing your own threads. This will effectively cap the number of threads you're running, as well as removing the performance hit you'd normally get when you create a new thread.
Now, you're right about not wanting to repeatedly fire updates while you wait for
Example #2
Your code might look something like this:
// We use a dictionary here for efficiency
var Updating = new Dictionary()<TheXMLObjectType, object>;
...
if (de >= DateTime.Now)
{
return finalXML();
}
else
{
// Lock the updating dictionary to prevent other threads from
// updating it before we're done.
lock (Updating)
{
// If the xml is already in the updating dictionary, it's being
// updated elsewhere, so we don't need to do anything.
// On the other hand, if it's not already being updated we need
// to queue RefreshFinalXml, and set the updating flag
if (!Updating.ContainsKey(xml))
{
// Use the thread pool for the work, rather than managing our own
ThreadPool.QueueUserWorkItem(delegate (Object o)
{
RefreshFinalXml();
lock(Updating)
{
Updating.Remove(xml);
}
});
// Set the xml in the updating dictionary
Updating[xml] = null;
}
}
return finalXML();
}
Hopefully that's enough for you to work off of.

I would go for a different method assuming the following
Your service is always running
You can afford/are allowed to getting the XML files even if you don't have any request to your service currently.
The XML files you fetch are the same files for all your requests. (that is the total number of XML files you need for all your responses are those 20 files)
The resulting XML file is not too big to keep in memory all the time
1
First of all I would not store the resulting XML in a file on disk but rather in a static variable.
2
Second I would create a timer set on 10 minutes that updates the cache even if you have no calls to your service. That way you always have quite recent data ready and cached even if your service was not called for a while. It also removes the need to think about if you already have a refresh "ongoing".
3
Third I would consider using threading/async calls to fetch all your 20 XML's in parallel. This is only useful if you want to reduce the refresh time. It could allow you to reduce the refresh interval from 10 to maybe 1-2 minutes if that is improving your service.
I would recommend 1 and 2, but 3 is more optional.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.