how to speed up large xml file read/write operations - c#

I have a windows form application that currently does the following:
1) point to a directory and do 2) for all the xml files in there (usually max of 25 files ranging from 10mb to !5gb! - uncommon but possible)
2) xml read/write to alter some of the existing xml attributes (currently I use a single background worker for that)
3) write the altered xml attributes directly to a NEW file in a different dir
the little app works fine but it takes far tooo long to finish (about 20min depending on the net gb size)
what I casually tried is start the main rw method in a Parallel.ForEach() but it blocked itself out unsurprisingly and exited
my idea would be to parallelize the read/write process by starting it on all ~25 files at the same time is this wise? how can I do it (TPL?) without locking myself out?
PS: I have quite a powerful desktop pc, with 1TB samsung pro ssd, 16gb of ram, and intel core i7

You can have a ThreadPool for this approach
You Can have a pool for a size of 20 files
Because you have core i7 , you should use TaskFactory.StartNew
In this case , you should encapsulate the code for processing on file in the a sample class like XMLProcessor
then with use of TaskFactory.StartNew you can use multithreading for xml processsing

This sounds like a job for data parallelism via PLINQ + asynchronous lambdas.
I recently needed to process data from a zip archive that itself contained 5,200 zip archives which then each contained one or more data files in XML or CSV format. In total, between 40-60 GB of data when decompressed and read into memory.
The algorithm browses through this data, makes decisions based on what it finds in conjunction with supplied predicates, and finally writes the selections to disk as 1.0-1.5 GB files. Using an async PLINQ pattern with 32 processors, the average run time for each output file was 4.23 minutes.
After implementing the straightforward solution with async PLINQ, I spent some time trying to improve the running time by digging down into the TPL and TPL Dataflow libraries. In the end, attempting to beat async PLINQ proved to be a fun but ultimately fruitless exercise for my needs. The performance margins from the more "optimized" solutions were not worth the added complexity.
Below is an example of the async PLINQ pattern. The initial collection is an array of file paths.
In the first step, each file path is asynchronously read into memory and parsed, the file name is cached as a root-level attribute, and streamed to the next function.
In the last step, each XElement is asynchronously written to a new file.
I recommend that you play around with the lambda that reads the files. In my case, I found that reading via an async lambda gave me better throughput while decompressing files in memory.
However, for simple XML documents, you may be better off replacing the first async lambda with a method call to XElement.Load(string file) and letting PLINQ read as needed.
using System.IO;
using System.Linq;
using System.Xml.Linq;
namespace AsyncPlinqExample
{
public class Program
{
public static void Main(string[] args)
{
// Limit parallelism here if needed
int degreeOfParallelism = Environment.ProcessorCount;
string resultDirectory = "[result directory path here]";
string[] files = Directory.GetFiles("[directory with files here]");
files.AsParallel()
.WithDegreeOfParallelism(degreeOfParallelism)
.Select(
async x =>
{
using (StreamReader reader = new StreamReader(x))
{
XElement root = XElement.Parse(await reader.ReadToEndAsync());
root.SetAttributeValue("filePath", Path.GetFileName(x));
return root;
}
})
.Select(x => x.Result)
.Select(
x =>
{
// Perform other manipulations here
return x;
})
.Select(
async x =>
{
string resultPath =
Path.Combine(
resultDirectory,
(string) x.Attribute("fileName"));
await Console.Out.WriteLineAsync($"{DateTime.Now}: Starting {(string) x.Attribute("fileName")}.");
using (StreamWriter writer = new StreamWriter(resultPath))
{
await writer.WriteAsync(x.ToString());
}
await Console.Out.WriteLineAsync($"{DateTime.Now}: Comleted {(string)x.Attribute("fileName")}.");
});
}
}
}

Related

IronPDF deadlocks when running in parallel

I'm trying to generate multiple PDFs in parallel using IronPDFs HTML to PDF feature. But it appears to be deadlocking when started from ASP.NET :(
I've recreated the problem here: https://github.com/snebjorn/ironpdf-threading-issue-aspnet
Here's a snippet with the essential parts.
Calling GetSequential() works. But is not executing in parallel.
GetSimple() is running in parallel but deadlocks.
public class TestController : Controller
{
[HttpGet]
[Route("simple")]
public async Task<IActionResult> GetSimple()
{
var tasks = Enumerable
.Range(1, 10)
.Select(i => HtmlToDocumentAsync("hello", i));
var pdfs = await Task.WhenAll(tasks);
using var pdf = PdfDocument.Merge(pdfs);
pdf.SaveAs("output.pdf");
return Ok();
}
[HttpGet]
[Route("seq")]
public async Task<IActionResult> GetSequential()
{
var pdfs = new List<PdfDocument>();
foreach (var i in Enumerable.Range(1, 10))
{
pdfs.Add(await HtmlToDocumentAsync("hello", i));
}
using var pdf = PdfDocument.Merge(pdfs);
pdf.SaveAs("output.pdf");
return Ok();
}
private async Task<PdfDocument> HtmlToDocumentAsync(string html, int i)
{
using var renderer = new HtmlToPdf();
var pdf = await renderer.RenderHtmlAsPdfAsync(html);
return pdf;
}
}
According to https://medium.com/rubrikkgroup/understanding-async-avoiding-deadlocks-e41f8f2c6f5d it's because the thread executing the controller method isn't a main thread. So it just gets added to the thread pool and at some point we're waiting for the controller thread to continue but it's not getting scheduled back in. This happens when we mix async/await with .Wait/.Result.
So am I right to assume that there are .Wait/.Result calls happening inside the IronPDF.Threading package?
Is there a workaround?
UPDATE:
I updated to IronPdf 2021.9.3737 and it now appears to work 🎉
Also updated https://github.com/snebjorn/ironpdf-threading-issue-aspnet
Just wanted to add to this that IronPdf's multi-threading support on MVC web apps is non-existent. You will end up with indefinite deadlocks if you're rendering in the context of an HTTP request.
We have been strung along with the promise of an updated renderer that would resolve the issue (which we were told should be released June/July 2021) but that appears to have been pushed back. I tested the updated renderer using their 'early-access package' and the deadlocking has been replaced by 10 second thread blocks and seemingly random C++ exceptions, so it's far from being fixed. Performance is better single-threaded.
Darren's reply is incorrect - I've stepped through our render calls countless times trying to fix this, and the deadlock comes on the HtmlToPdf.StaticRenderHtmlAsPdf call, not on a PdfDocument.Merge call. It's a threading issue.
I suggest avoiding IronPdf if you haven't already bought their product. Find another solution.
I used IronPDF on the 2021.9.3737 branch without any threading issues on windows and Linux today thanks to Darren's help in another thread.
Documentation: https://ironpdf.com/object-reference/api/
Nuget: https://www.nuget.org/packages/IronPdf/
Source of information: C# PDF Generation (using IronPDF on Azure)
I agree with abagonhishead that StaticRenderHtmlAsPdf used to create a que of PDF documents to be rendered, and on an under-provisioned server it ended in a thread deadlock... the que getting longer and longer as the server struggled to render PDFs.
Solution that worked for me:
moving to a well provisioned server (Azure B1 for example)
(and/or) moving to IronPDF latest Nuget 2021.9.3737
Support for Iron Software here.
Our engineers tested your sample project, increasing to 150 iterations and saw it running without issue.
Our expectation of your use-case is that you are creating multiple threads to generate PDF files and store these files into an array for merging later?
Assuming this is the case, the likely cause of this issue is sending too large an array to the merge method, which requires a large amount of RAM to process. The crash is the memory not handling the large number of PDFS to merge.
as you can see from the attached image, I tested your code with 1000 iterations and it works without issues, I believe the problem may occur when you increase the iterations or input big HTML size that reaches the max CPU and memory capacity that can't handle.
also, I don't agree with abagonhishead because there is not an alternative solution in the market that offer all these features

Why CPU usage not increase?

Having the second code:
class Methods
{
public MemoryStream UniqPicture(string imagePath)
{
var photoBytes = File.ReadAllBytes(imagePath); // change imagePath with a valid image path
var quality = 70;
var format = ImageFormat.Jpeg; // we gonna convert a jpeg image to a png one
var size = new Size(200, 200);
using (var inStream = new MemoryStream(photoBytes))
{
using (var outStream = new MemoryStream())
{
using (var imageFactory = new ImageFactory())
{
imageFactory.Load(inStream)
.Rotate(new Random().Next(-7, 7))
.RoundedCorners(new RoundedCornerLayer(190))
.Pixelate(3, null)
.Contrast(new Random().Next(-15, 15))
.Brightness(new Random().Next(-15, 15))
.Quality(quality)
.Save(outStream);
}
return outStream;
}
}
}
public void StartUniq()
{
var files = Directory.GetFiles("mypath");
Parallel.ForEach(files, (picture) => { UniqPicture(picture); });
}
}
When I start method StartUniq() my CPU bound to 12-13% and no more. Can I use more CPU % for doing this operation? Why it not increase?
I try to do it from python, it's also only 12-13%. It's Core i7 8700.
The only way to do it operation faster it's to start the second window of application.
It's windows limit? Using Windows Server 2016.
I think this is system limit, because if I try this simple code it's bound 12% CPU too!
while (true)
{
var a = 1 + 2;
}
A bit of research shows that you are using ImageFactory from https://imageprocessor.org/, which wraps System.Drawing. System.Drawing itself is often a wrapper for GDI/GDI+, which... incorporates process-wide locks, so your attempts at multithreading will be severely bottlenecked. Try a better image library.
(See Robert McKee's answer, although maybe this could be about disk IO but maybe not.)
So, I haven't used Paralell.ForEach before, but it seems like you should be running your UniqPicture method in parallel for all files in a given directory. I think your approach is good here, but ultimately your hard drive is probably killing the speed of your program (and vice versa).
Have you tried running UniqPicture in a loop sequentially? My concern here is that your hard drive is thrashing possibly. But in general, it's most likely that the input / output (IO) from your hard drive is taking a considerable amount of time, so the CPU is waiting a considerable amount of time before it can operate on the images in UniqPicture. If you could pre-load the images into memory, I would think the CPU utilization would be much higher, if not maxing out your CPU.
In no particular order, here are some thoughts
What happens if you run this sequentially? This will max out one core on the CPU at max, but it may prevent hard drive thrashing. If there are 100 threads being spun up, that's a lot of requests for the hard drive to deal with at once.
You should be able to add this option to make it run sequentially (or just make it a normal loop without Parallel.):
new ParallelOptions { MaxDegreeOfParallelism = 1 },
Maybe try 2, 3, or 4 threads and see if anything changes.
Check your hard drive utilization in task manager. What's the latency on the hard drive where the images are stored? What's the percentage that Winows reports it as busy? You want the hard drive to be busy the entire time (100% usage), but you also want it to be grabbing your images with the highest throughput possible so the CPU can do its job.
A spinning hard drive (HDD) has far lower IOPS (IO per second) than an SSD in general. An SSD will usually have 1000 to 100,000+ IOPS, but a HDD is around 200, I believe, and has much lower throughput usually. An SSD should help your program utilize the CPU much more.
The size of the image files could have an impact here, again relating to IO.
Or maybe see Robert Mckee's answer about your threads getting bottlenecked. Maybe 13% CPU utilization is the best you can get. 1 / 6 (your CPU has 6 cores) cores being maxed is ~16.7%, so you actually aren't that far off on maxing one core already.
Ultimately, time how long it's taking. CPU utilization should scale inversely linearly (higher CPU usage = lower run time) with the time this takes to run, but time it just be to be sure since that's the real benchmark.

Parallel GZip Decompression of Log Files - Tweaking MaxDegreeOfParallelism for the Highest Throughput

We have up to 30 GB of GZipped log files per day. Each file holds 100.000 lines and is between 6 and 8 MB when compressed. The simplified code in which the parsing logic has been stripped out, utilises the Parallel.ForEach loop.
The maximum number of lines processed peaks at MaxDegreeOfParallelism of 8 on the two-NUMA node, 32 logical CPU box (Intel Xeon E7-2820 # 2 GHz):
using System;
using System.Collections.Concurrent;
using System.Linq;
using System.IO;
using System.IO.Compression;
using System.Threading.Tasks;
namespace ParallelLineCount
{
public class ScriptMain
{
static void Main(String[] args)
{
int maxMaxDOP = (args.Length > 0) ? Convert.ToInt16(args[0]) : 2;
string fileLocation = (args.Length > 1) ? args[1] : "C:\\Temp\\SomeFiles" ;
string filePattern = (args.Length > 1) ? args[2] : "*2012-10-30.*.gz";
string fileNamePrefix = (args.Length > 1) ? args[3] : "LineCounts";
Console.WriteLine("Start: {0}", DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ"));
Console.WriteLine("Processing file(s): {0}", filePattern);
Console.WriteLine("Max MaxDOP to be used: {0}", maxMaxDOP.ToString());
Console.WriteLine("");
Console.WriteLine("MaxDOP,FilesProcessed,ProcessingTime[ms],BytesProcessed,LinesRead,SomeBookLines,LinesPer[ms],BytesPer[ms]");
for (int maxDOP = 1; maxDOP <= maxMaxDOP; maxDOP++)
{
// Construct ConcurrentStacks for resulting strings and counters
ConcurrentStack<Int64> TotalLines = new ConcurrentStack<Int64>();
ConcurrentStack<Int64> TotalSomeBookLines = new ConcurrentStack<Int64>();
ConcurrentStack<Int64> TotalLength = new ConcurrentStack<Int64>();
ConcurrentStack<int> TotalFiles = new ConcurrentStack<int>();
DateTime FullStartTime = DateTime.Now;
string[] files = System.IO.Directory.GetFiles(fileLocation, filePattern);
var options = new ParallelOptions() { MaxDegreeOfParallelism = maxDOP };
// Method signature: Parallel.ForEach(IEnumerable<TSource> source, Action<TSource> body)
Parallel.ForEach(files, options, currentFile =>
{
string filename = System.IO.Path.GetFileName(currentFile);
DateTime fileStartTime = DateTime.Now;
using (FileStream inFile = File.Open(fileLocation + "\\" + filename, FileMode.Open))
{
Int64 lines = 0, someBookLines = 0, length = 0;
String line = "";
using (var reader = new StreamReader(new GZipStream(inFile, CompressionMode.Decompress)))
{
while (!reader.EndOfStream)
{
line = reader.ReadLine();
lines++; // total lines
length += line.Length; // total line length
if (line.Contains("book")) someBookLines++; // some special lines that need to be parsed later
}
TotalLines.Push(lines); TotalSomeBookLines.Push(someBookLines); TotalLength.Push(length);
TotalFiles.Push(1); // silly way to count processed files :)
}
}
}
);
TimeSpan runningTime = DateTime.Now - FullStartTime;
// Console.WriteLine("MaxDOP,FilesProcessed,ProcessingTime[ms],BytesProcessed,LinesRead,SomeBookLines,LinesPer[ms],BytesPer[ms]");
Console.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7}",
maxDOP.ToString(),
TotalFiles.Sum().ToString(),
Convert.ToInt32(runningTime.TotalMilliseconds).ToString(),
TotalLength.Sum().ToString(),
TotalLines.Sum(),
TotalSomeBookLines.Sum().ToString(),
Convert.ToInt64(TotalLines.Sum() / runningTime.TotalMilliseconds).ToString(),
Convert.ToInt64(TotalLength.Sum() / runningTime.TotalMilliseconds).ToString());
}
Console.WriteLine();
Console.WriteLine("Finish: " + DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ"));
}
}
}
Here's a summary of the results, with a clear peak at MaxDegreeOfParallelism = 8:
The CPU load (shown aggregated here, most of the load was on a single NUMA node, even when DOP was in 20 to 30 range):
The only way I've found to make CPU load cross 95% mark was to split the files across 4 different folders and execute the same command 4 times, each one targeting a subset of all files.
Can someone find a bottleneck?
It's likely that one problem is the small buffer size used by the default FileStream constructor. I suggest you use a larger input buffer. Such as:
using (FileStream infile = new FileStream(
name, FileMode.Open, FileAccess.Read, FileShare.None, 65536))
The default buffer size is 4 kilobytes, which has the thread making many calls to the I/O subsystem to fill its buffer. A buffer of 64K means that you will make those calls much less frequently.
I've found that a buffer size of between 32K and 256K gives the best performance, with 64K being the "sweet spot" when I did some detailed testing a while back. A buffer size larger than 256K actually begins to reduce performance.
Also, although this is unlikely to have a major effect on performance, you probably should replace those ConcurrentStack instances with 64-bit integers and use Interlocked.Add or Interlocked.Increment to update them. It simplifies your code and removes the need to manage the collections.
Update:
Re-reading your problem description, I was struck by this statement:
The only way I've found to make CPU load cross 95% mark was to split
the files across 4 different folders and execute the same command 4
times, each one targeting a subset of all files.
That, to me, points to a bottleneck in opening files. As though the OS is using a mutual exclusion lock on the directory. And even if all the data is in the cache and there's no physical I/O required, processes still have to wait on this lock. It's also possible that the file system is writing to the disk. Remember, it has to update the Last Access Time for a file whenever it's opened.
If I/O really is the bottleneck, then you might consider having a single thread that does nothing but load files and stuff them into a BlockingCollection or similar data structure so that the processing threads don't have to contend with each other for a lock on the directory. Your application becomes a producer/consumer application with one producer and N consumers.
The reason for this is usually that threads synchronize too much.
Looking for synchronization in your code I can see heavy syncing on the collections. Your threads are pushing the lines individually. This means that each line incurs at best an interlocked operation and at worst a kernel-mode lock wait. The interlocked operations will contend heavily because all threads race to get their current line into the collection. They all try to update the same memory locations. This causes cache line pinging.
Change this to push lines in bigger chunks. Push line-arrays of 100 lines or more. The more the better.
In other words, collect results in a thread-local collection first and only rarely merge into the global results.
You might even want to get rid of the manual data pushing altogether. This is what PLINQ is made for: Streaming data concurrently. PLINQ abstracts away all the concurrent collection manipulations in a well-performing way.
I don't think Parallelizing the disk reads is helping you. In fact, this could be seriously impacting your performance by creating contention in reading from multiple areas of storage at same time.
I would restructure the program to first do a single-threaded read of raw file data into a memory stream of byte[]. Then, do a Parallel.ForEach() on each stream or buffer to decompress and count the lines.
You take an initial IO read hit up front but let the OS/hardware optimize the hopefully mostly sequential reads, then decompress and parse in memory.
Keep in mind that operations like decomprless, Encoding.UTF8.ToString(), String.Split(), etc. will use large amounts of memory, so clean up references to/dispose of old buffers as you no longer need them.
I'd be surprised if you can't cause the machine to generate some serious waste hit this way.
Hope this helps.
The problem, I think, is that you are using blocking I/O, so your threads cannot fully take advantage of parallelism.
If I understand your algorithm right (sorry, I'm more of a C++ guy) this is what you are doing in each thread (pseudo-code):
while (there is data in the file)
read data
gunzip data
Instead, a better approach would be something like this:
N = 0
read data block N
while (there is data in the file)
asyncRead data block N+1
gunzip data block N
N = N + 1
gunzip data block N
The asyncRead call does not block, so basically you have the decoding of block N happening concurrently with the reading of block N+1, so by the time you are done decoding block N you might have block N+1 ready (or close to be ready if I/O is slower than decoding).
Then it's just a matter of finding the block size that gives you the best throughput.
Good luck.

C# Threading - Reading and hashing multiple files concurrently, easiest method?

I've been trying to get what I believe to be the simplest possible form of threading to work in my application but I just can't do it.
What I want to do: I have a main form with a status strip and a progress bar on it. I have to read something between 3 and 99 files and add their hashes to a string[] which I want to add to a list of all files with their respective hashes. Afterwards I have to compare the items on that list to a database (which comes in text files).
Once all that is done, I have to update a textbox in the main form and the progressbar to 33%; mostly I just don't want the main form to freeze during processing.
The files I'm working with always sum up to 1.2GB (+/- a few MB), meaning I should be able to read them into byte[]s and process them from there (I have to calculate CRC32, MD5 and SHA1 of each of those files so that should be faster than reading all of them from a HDD 3 times).
Also I should note that some files may be 1MB while another one may be 1GB. I initially wanted to create 99 threads for 99 files but that seems not wise, I suppose it would be best to reuse threads of small files while bigger file threads are still running. But that sounds pretty complicated to me so I'm not sure if that's wise either.
So far I've tried workerThreads and backgroundWorkers but neither seem to work too well for me; at least the backgroundWorkers worked SOME of the time, but I can't even figure out why they won't the other times... either way the main form still froze.
Now I've read about the Task Parallel Library in .NET 4.0 but I thought I should better ask someone who knows what he's doing before wasting more time on this.
What I want to do looks something like this (without threading):
List<string[]> fileSpecifics = new List<string[]>();
int fileMaxNumber = 42; // something between 3 and 99, depending on file set
for (int i = 1; i <= fileMaxNumber; i++)
{
string fileName = "C:\\path\\to\\file" + i.ToString("D2") + ".ext"; // file01.ext - file99.ext
string fileSize = new FileInfo(fileName).Length.ToString();
byte[] file = File.ReadAllBytes(fileName);
// hash calculations (using SHA1CryptoServiceProvider() etc., no problems with that so I'll spare you that, return strings)
file = null; // I didn't yet check if this made any actual difference but I figured it couldn't hurt
fileSpecifics.Add(new string[] { fileName, fileSize, fileCRC, fileMD5, fileSHA1 });
}
// look for files in text database mentioned above, i.e. first check for "file bundles" with the same amount of files I have here; then compare file sizes, then hashes
// again, no problems with that so I'll spare you that; the database text files are pretty small so parsing them doesn't need to be done in an extra thread.
Would anybody be kind enough to point me in the right direction? I'm looking for the easiest way to read and hash those files quickly (I believe the hashing takes some time in which other files could already be read) and save the output to a string[], without the main form freezing, nothing more, nothing less.
I'm thankful for any input.
EDIT to clarify: by "backgroundWorkers working some of the time" I meant that (for the very same set of files), maybe the first and fourth execution of my code produces the correct output and the UI unfreezes within 5 seconds, for the second, third and fifth execution it freezes the form (and after 60 seconds I get an error message saying some thread didn't respond within that time frame) and I have to stop execution via VS.
Thanks for all your suggestions and pointers, as you all have correctly guessed I'm completely new to threading and will have to read up on the great links you guys posted.
Then I'll give those methods a try and flag the answer that helped me the most. Thanks again!
With .NET Framework 4.X
Use Directory.EnumerateFiles Method for efficient/lazy files enumeration
Use Parallel.For() to delegate parallelism work to PLINQ framework or use TPL to delegate single Task per pipeline Stage
Use Pipelines pattern to pipeline following stages: calculating hashcodes, compare with pattern, update UI
To avoid UI freeze use appropriate techniques: for WPF use Dispatcher.BeginInvoke(), for WinForms use Invoke(), see this SO answer
Considering that all this stuff has UI it might be useful adding some cancellation feature to abandon long running operation if needed, take a look at the CreateLinkedTokenSource class which allows triggering CancellationToken from the "external scope"
I can try adding an example but it's worth do it yourself so you would learn all this stuff rather than simply copy/paste - > got it working -> forgot about it.
PS: Must read - Pipelines paper at MSDN
TPL specific pipeline implementation
Pipeline pattern implementation: three stages: calculate hash, match, update UI
Three tasks, one per stage
Two Blocking Queues
//
// 1) CalculateHashesImpl() should store all calculated hashes here
// 2) CompareMatchesImpl() should read input hashes from this queue
// Tuple.Item1 - hash, Typle.Item2 - file path
var calculatedHashes = new BlockingCollection<Tuple<string, string>>();
// 1) CompareMatchesImpl() should store all pattern matching results here
// 2) SyncUiImpl() method should read from this collection and update
// UI with available results
var comparedMatches = new BlockingCollection<string>();
var factory = new TaskFactory(TaskCreationOptions.LongRunning,
TaskContinuationOptions.None);
var calculateHashesWorker = factory.StartNew(() => CalculateHashesImpl(...));
var comparedMatchesWorker = factory.StartNew(() => CompareMatchesImpl(...));
var syncUiWorker= factory.StartNew(() => SyncUiImpl(...));
Task.WaitAll(calculateHashesWorker, comparedMatchesWorker, syncUiWorker);
CalculateHashesImpl():
private void CalculateHashesImpl(string directoryPath)
{
foreach (var file in Directory.EnumerateFiles(directoryPath))
{
var hash = CalculateHashTODO(file);
calculatedHashes.Add(new Tuple<string, string>(hash, file.Path));
}
}
CompareMatchesImpl():
private void CompareMatchesImpl()
{
foreach (var hashEntry in calculatedHashes.GetConsumingEnumerable())
{
// TODO: obviously return type is up to you
string matchResult = GetMathResultTODO(hashEntry.Item1, hashEntry.Item2);
comparedMatches.Add(matchResult);
}
}
SyncUiImpl():
private void UpdateUiImpl()
{
foreach (var matchResult in comparedMatches.GetConsumingEnumerable())
{
// TODO: track progress in UI using UI framework specific features
// to do not freeze it
}
}
TODO: Consider using CancellationToken as a parameter for all GetConsumingEnumerable() calls so you easily can stop a pipeline execution when needed.
First off, you should be using a higher level of abstraction to solve this problem. You have a bunch of tasks to complete, so use the "task" abstraction. You should be using the Task Parallel Library to do this sort of thing. Let the TPL deal with the question of how many worker threads to create -- the answer could be as low as one if the work is gated on I/O.
If you do want to do your own threading, some good advice:
Do not ever block on the UI thread. That's is what is freezing your application. Come up with a protocol by which working threads can communicate with your UI thread, which then does nothing except for responding to UI events. Remember that methods of user interface controls like task completion bars must never be called by any other thread other than the UI thread.
Do not create 99 threads to read 99 files. That's like getting 99 pieces of mail and hiring 99 assistants to write responses: an extraordinarily expensive solution to a simple problem. If your work is CPU intensive then there is no point in "hiring" more threads than you have CPUs to service them. (That's like hiring 99 assistants in an office that only has four desks. The assistants spend most of their time waiting for a desk to sit at instead of reading your mail.) If your work is disk-intensive then most of those threads are going to be idle most of the time waiting for the disk, which is an even bigger waste of resources.
First, I hope you are using a built-in library for calculating hashes. It's possible to write your own, but it's far safer to use something that has been around for a while.
You may need only create as many threads as CPUs if your process is CPU intensive. If it is bound by I/O, you might be able to get away with more threads.
I do not recommend loading the entire file into memory. Your hashing library should support updating a chunk at a time. Read a chunk into memory, use it to update the hashes of each algorighm, read the next chunk, and repeat until end of file. The chunked approach will help lower your program's memory demands.
As others have suggested, look into the Task Parallel Library, particularly Data Parallelism. It might be as easy as this:
Parallel.ForEach(fileSpecifics, item => CalculateHashes(item));
Check out TPL Dataflow. You can use a throttled ActionBlock which will manage the hard part for you.
If my understanding that you are looking to perform some tasks in the background and not block your UI, then the UI BackgroundWorker would be an appropriate choice. You mentioned that you got it working some of the time, so my recommendation would be to take what you had in a semi-working state, and improve upon it by tracking down the failures. If my hunch is correct, your worker was throwing an exception, which it does not appear you are handling in your code. Unhandled exceptions that bubble out of their containing threads make bad things happen.
This code hashing one file (stream) using two tasks - one for reading, second for hashing, for more robust way you should read more chunks forward.
Because bandwidth of processor is much higher than of disk, unless you use some high speed Flash drive you gain nothing from hashing more files concurrently.
public void TransformStream(Stream a_stream, long a_length = -1)
{
Debug.Assert((a_length == -1 || a_length > 0));
if (a_stream.CanSeek)
{
if (a_length > -1)
{
if (a_stream.Position + a_length > a_stream.Length)
throw new IndexOutOfRangeException();
}
if (a_stream.Position >= a_stream.Length)
return;
}
System.Collections.Concurrent.ConcurrentQueue<byte[]> queue =
new System.Collections.Concurrent.ConcurrentQueue<byte[]>();
System.Threading.AutoResetEvent data_ready = new System.Threading.AutoResetEvent(false);
System.Threading.AutoResetEvent prepare_data = new System.Threading.AutoResetEvent(false);
Task reader = Task.Factory.StartNew(() =>
{
long total = 0;
for (; ; )
{
byte[] data = new byte[BUFFER_SIZE];
int readed = a_stream.Read(data, 0, data.Length);
if ((a_length == -1) && (readed != BUFFER_SIZE))
data = data.SubArray(0, readed);
else if ((a_length != -1) && (total + readed >= a_length))
data = data.SubArray(0, (int)(a_length - total));
total += data.Length;
queue.Enqueue(data);
data_ready.Set();
if (a_length == -1)
{
if (readed != BUFFER_SIZE)
break;
}
else if (a_length == total)
break;
else if (readed != BUFFER_SIZE)
throw new EndOfStreamException();
prepare_data.WaitOne();
}
});
Task hasher = Task.Factory.StartNew((obj) =>
{
IHash h = (IHash)obj;
long total = 0;
for (; ; )
{
data_ready.WaitOne();
byte[] data;
queue.TryDequeue(out data);
prepare_data.Set();
total += data.Length;
if ((a_length == -1) || (total < a_length))
{
h.TransformBytes(data, 0, data.Length);
}
else
{
int readed = data.Length;
readed = readed - (int)(total - a_length);
h.TransformBytes(data, 0, data.Length);
}
if (a_length == -1)
{
if (data.Length != BUFFER_SIZE)
break;
}
else if (a_length == total)
break;
else if (data.Length != BUFFER_SIZE)
throw new EndOfStreamException();
}
}, this);
reader.Wait();
hasher.Wait();
}
Rest of code here: http://hashlib.codeplex.com/SourceControl/changeset/view/71730#514336

C# multiple text file processing

Let's say that you want to write an application that processes multiple text files, supplied as arguments at the command line (e.g., MyProcessor file1 file2 ...). This is a very common task for which Perl is often used, but what if one wanted to take advantage of .NET directly and use C#.
What is the simplest C# 4.0 application boiler plate code that allows you to do this? It should include basically line by line processing of each line from each file and doing something with that line, by either calling a function to process it or maybe there's a better way to do this sort of "group" line processing (e.g., LINQ or some other method).
You could process files in parallel by reading each line and passing it to a processing function:
class Program
{
static void Main(string[] args)
{
Parallel.ForEach(args, file =>
{
using (var stream = File.OpenRead(file))
using (var reader = new StreamReader(stream))
{
string line;
while ((line = reader.ReadLine()) != null)
{
ProcessLine(line);
}
}
});
}
static void ProcessLine(string line)
{
// TODO: process the line
}
}
Now simply call : SomeApp.exe file1 file2 file3
Pros of this approach:
Files are processed in parallel => taking advantage of multiple CPU cores
Files are read line by line and only the current line is kept into memory which reduces memory consumption and allows you to work with big files
Simple;
foreach(var f in args)
{
var filecontent = File.ReadToEnd();
//Logic goes here
}
After much experimenting, changing this line in Darin Dimitrov's answer:
using (var stream = File.OpenRead(file))
to:
using (var stream=new FileStream(file,System.IO.FileMode.Open,
System.IO.FileAccess.Read,
System.IO.FileShare.ReadWrite,
65536))
to change the read buffer size from the 4KB default to 64KB can shave as much as 10% off of the file read time when read "line at a time" via a stream reader, especially if the text file is large. Larger buffer sizes do not seem to improve performance further.
This improvement is present, even when reading from a relatively fast SSD. The savings are even more substantial if an ordinary HD is used. Interestingly, you get this significant performance improvement even if the file is already cached by the (Windows 7 / 2008R2) OS, which is somewhat counterintuitive.

Categories

Resources