Looking for a best approach to reading from data source such as Azure Table Storage which is time consuming and converting the data in to json or csv and writing in to local file with file name depending on partition key.
One approach being considered is running the writing to file task on timer elapsed event trigger with fixed time interval.
For things that do not parallize well (like I/O) the best thing to do is use the "Producer-Consumer model".
The way it works is you have one thread handling the non parallizeable task, all that task does is read in to a buffer. Then you have a set of parallel tasks that all read from the buffer and process the data, they then put the data in to another buffer when they are done processing the data. If you then need to write out the result again in a non parallizeable way you then have another single task writing out the result.
public Stream ProcessData(string filePath)
{
using(var sourceCollection = new BlockingCollection<string>())
using(var destinationCollection = new BlockingCollection<SomeClass>())
{
//Create a new background task to start reading in the file
Task.Factory.StartNew(() => ReadInFile(filePath, sourceCollection), TaskCreationOptions.LongRunning);
//Create a new background task to process the read in lines as they come in
Task.Factory.StartNew(() => TransformToClass(sourceCollection, destinationCollection), TaskCreationOptions.LongRunning);
//Process the newly created objects as they are created on the same thread that we originally called the function with
return TrasformToStream(destinationCollection);
}
}
private static void ReadInFile(string filePath, BlockingCollection<string> collection)
{
foreach(var line in File.ReadLines(filePath))
{
collection.Add(line);
}
//This lets the consumer know that we will not be adding any more items to the collection.
collection.CompleteAdding();
}
private static void TransformToClass(BlockingCollection<string> source, BlockingCollection<SomeClass> dest)
{
//GetConsumingEnumerable() will take items out of the collection and block the thread if there are no items available and CompleteAdding() has not been called yet.
Parallel.ForEeach(source.GetConsumingEnumerable(),
(line) => dest.Add(SomeClass.ExpensiveTransform(line));
dest.CompleteAdding();
}
private static Stream TrasformToStream(BlockingCollection<SomeClass> source)
{
var stream = new MemoryStream();
foreach(var record in source.GetConsumingEnumerable())
{
record.Seralize(stream);
}
return stream;
}
I highly recommend you read the free book Patterns for Parallel Programming, it goes in to some detail about this. There is a entire section explaining the Producer-Consumer model in detail.
UPDATE: For small performance boot use GetConsumingPartitioner() instead of GetConsumingEnumerable() from Parallel Extension Extras in the Parallel.ForEach loop. ForEach makes some assumptions about the IEnumerable being passed in that cause it to take extra locks out that it does not need to, by passing a partitioner instead of a enumerable it does not need to take those extra locks.
Related
Put break point before Thread start and you will notice console consuming about 5 to 8 MB memory but once Thread started it spike to 17 to 20 MB memory. And this memory stay used until close console. How can i freeup memory after Thread finished it task? Any better solution?
Now question is: Why i need it since garbage collector will automatically free up memory when needed. I need it because i am doing web scraping and i got a global class to store all scraped html text there and i have to scrape like 10k pages and store that html to global class. What happen is: when i run this app after scrape 500 html data to global class it eat almost 100% of my pc RAM which is 20 GB. So i need to free up RAM. I cant close console app to free up ram bcoz i have some calculation after collect all html.
class DemoData
{
public int Id { get; set; }
public string Text { get; set; }
public static List<DemoData> data = new List<DemoData>();
}
class Program
{
public static void Main()
{
for (var i = 0; i < 5000; i++)
{
DemoData.data.Add(new DemoData
{
Id = i,
Text = "something....",
});
}
foreach (var item in DemoData.data)
{
var t = new Thread(new ThreadStart(DoSomething));//put break point here and see.
t.Name = item.Id.ToString(); ;
t.Start();
}
Console.WriteLine("wait");
Console.ReadLine();
}
public static void DoSomething()
{
Thread thr = Thread.CurrentThread;
Console.WriteLine(thr.Name);
}
}
you just need to wait for for the garbage collector to run, your current app isn't complex enough to require this except on close so you aren't seeing this occur,
so the framework will fire off the GC when it feels it is needed, this will then look though the callstack and decide which objects it no longer needs and delete them freeing up memory, this will happen completely automatically with out you needing to do anything.
if you want to help the GC out you can check your variable scopes so that the GC can see its not needed anymore because its completely out of scope, so avoiding global variables, not creating links between data objects that aren't needed making correct choices between value and reference types, ect
however if you ever do end up in a position where you need to manually fire the GC you can call GC.Collect()
see
https://learn.microsoft.com/en-us/dotnet/api/system.gc.collect?view=net-5.0
You might need to use a memory profiler to check for potential memory leaks.
A possible reason for your issues is the large number of threads, there is no reason to use more threads than there are cpu cores. Each thread used will need some memory for a stack and other house keeping.
One fairly simple way would be to put addresses that needs visiting in a collection, and use a parallel.Foreach loop. This will try to adjust the number of threads used to maximize thruput. More complex variants could use one of the concurrent collections and multiple consumers. There is also async variants of many IO calls to avoid the memory overhead of blocking a thread while waiting for IO. I would recommend reading some examples of multiple producers/consumers pattern for more details.
if your problem is feeding the data into the threads for processing then i would suggest you look at https://devblogs.microsoft.com/dotnet/an-introduction-to-system-threading-channels/ this allows you to then create a thread safe processing buffer
here is working example, note this is using the .net 5 syntax
using System;
using System.Threading.Channels;
using System.Net.Http;
using System.Collections.Generic;
using System.Threading.Tasks;
using System.Linq;
var sites = Enumerable.Range(0, 5000).Select(i => #"http:\\www.example.com");
//create thread safe buffer no limit on size
var sitebuffer = Channel.CreateUnbounded<string>();
//create thread safe buffer limited to 10 elements
var htmlbuffer = Channel.CreateBounded<string>(10);
async Task Feed()
{
//while the buffer hasn't closed, wait for new data to be available
while(await sitebuffer.Reader.WaitToReadAsync())
{
//read the next available url from the buffer
var uri = await sitebuffer.Reader.ReadAsync();
var http = new HttpClient();
var html= await http.GetAsync(uri);
Console.WriteLine("reading site");
//load the return text to the htmlbuffer, if buffer is full wait for space
await htmlbuffer.Writer.WriteAsync(await html.Content.ReadAsStringAsync());
Console.WriteLine("reading site complete");
}
}
async Task Process()
{
//while the buffer hasn't closed, wait for new data to be available
while (await htmlbuffer.Reader.WaitToReadAsync())
{
//read html from buffer send to doSomething then read next element
var html = await htmlbuffer.Reader.ReadAsync();
await doSomethingWithHTML(html);
}
}
async Task doSomethingWithHTML(string html)
{
await Task.Delay(2);
Console.WriteLine("done something");
}
//start 4 feeders threads
var feeders = new[]
{
Feed(),
Feed(),
Feed(),
Feed(),
};
//start 2 worker threads
var workers = new[]
{
Process(),
Process(),
};
//start of feeding in sites
foreach (var item in sites)
{
await sitebuffer.Writer.WriteAsync(item);
}
//mark that all sites have been fed into the systems
sitebuffer.Writer.Complete();
//wait for all feeders to finish
await Task.WhenAll(feeders);
//mark that no more sites will be read
htmlbuffer.Writer.Complete();
//wait for all workers to finish
await Task.WhenAll(workers);
Console.WriteLine("all tasks complete");
notice the async Task's and awaits this is a newer wrapper around threads that simplifies a lot of the complexity in managing threads
What is the best Queue Data structure to use in C# when the Queue needs to be accsible for Enqueue() on multiple threads but only needs to Dequeue() on a single main thread? My thread structure looks like this:
Main Thread - Consumer
Sub Thread1 - Producer
Sub Thread2 - Producer
Sub Thread3 - Producer
I have a single Queue<T> queue that holds all items produced by the sub-threads and the Main Thread calls queue.Dequeue() until it is empty. I have the following function that is called on my Main Thread for this purpose.
public void ConsumeItems()
{
while (queue.Count > 0)
{
var item = queue.Dequeue();
...
}
}
The Main Thread calls this function once through each thread loop and I want to make sure I am accessing queue in a thread-safe manor but I also want to avoid locking queue if possible for performance reasons.
The one you would want to use is a BlockingCollection<T> which by default is backed by a ConcurrentQueue<T>. To get items out of the queue you would use .GetConsumingEnumerable() from inside a foreach
public BlockingCollection<Item> queue = new BlockingCollection<Item>();
public void LoadItems()
{
var(var item in SomeDataSource())
{
queue.Add(item);
}
queue.CompleteAdding();
}
public void ConsumeItems()
{
foreach(var item in queue.GetConsumingEnumerable())
{
...
}
}
When the queue is empty the foreach will block the thread and unblock as soon as a item becomes available. once .CompleteAdding() has been called the foreach will finish processing any items in the queue but once it is empty it will exit the foreach block.
However, before you do this, I would recommend you look in to TPL Dataflow, with it you don't need to manage the queues or the threads anymore. It lets you build chains of logic and each block in the chain can have a separate level of concurrency.
public Task ProcessDataAsync(IEnumerable<SomeInput> input)
{
using(var outfile = new File.OpenWrite("outfile.txt"))
{
//Create a convert action that uses the number of processors on the machine to create parallel blocks for processing.
var convertBlock = new TransformBlock<SomeInput, string>(x => CpuIntensiveConversion(x), new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = Enviorment.ProcessorCount});
//Create a single threaded action that writes out to the textwriter.
var writeBlock = new ActionBlock<string>(x => outfile.WriteLine(x))
//Link the convert block to the write block.
convertBlock.LinkTo(writeBlock, new DataflowLinkOptions{PropagateCompletion = true});
//Add items to the convert block's queue.
foreach(var item in input)
{
await convertBlock.SendAsync();
}
//Tell the convert block we are done adding. This will tell the write block it is done processing once all items are processed.
convertBlock.Complete();
//Wait for the write to finish writing out to the file;
await writeBlock.Completion;
}
}
My process Gets the data through HTTP request and it will get the data in Chunks(100 records at a time). in my case I had 100,000 records.
and then I need to process that data and load it into DB..
MY Current Process..
GrabAllRecords()
{
GRAB all 100,000 records(i.e 1000 requests).. its big amount of time.
Load into ArrayData
}
then..
Process Data(ArrayData)
{
}
But I need some thing like this...
START:
step1:
Grab 100 Records load into arraylist..
repeat step1 until it reach 100,000
step2:
process arrayList
This screams for the producer - consumer design pattern: one producer produces something in its own pace, while one or more consumers wait until something is produced, grab the produced information and process it, possibly leading to new produced output that other consumers might process.
Microsoft has good support for this via Microsoft TPL Dataflow nuget package.
Implement a Producer-Consumer Dataflow Pattern
Also helpful to start: Walkthrough: Creating a Dataflow Pipeline
The producer produces output in processable units, in your case: chunks. The output will be sent to an object of class BufferBlock< T > , where T is your chunk. Code will be similar to:
public class ChunkProducer
{
private BufferBlock<Chunk> outputBuffer = new BufferBlock<Chunk>;
// whenever the ChunkProducer produces a chunk it is put in this buffer
// consumers will need access to this outputbuffer as source of data:
public ISourceBlock<Chunk> OutputBuffer
{get {return this.outputBuffer as ISourceBlock<Chunk>;} }
public async Task ProduceAsync()
{
while(someThingsToProcess)
{
Chunk chunk = CreateChunk(...);
await this.outputBuffer.SendAsync(chunk);
}
// if here: nothing to process anymore.
// notify consumers that all output has been produced
this.outputBuffer.Complete();
}
The efficiency of this can be enhanced by creating the next chunk while the previous one is being sent and await before sending the next chunk. This is a bit out of scope here. More info about this is available on Stackoverflow.
You'll also need a ChunkConsumer. The ChunkConsumer will wait for chunks on the buffer block and process them:
public class ChunkConsumer
{
private readonly ISourceBlock<Chunk> chunkSource;
// the chunkConsumer will wait for input at this source
public ChunkConsumer(ISourceBlock<Chunk> chunkSource)
{
this.chunkSource = chunkSource
}
public async Task ConsumeAsync()
{
// wait until there is some data in the buffer
while (await this.chunkSource.OutputAvailableAsync())
{
// get the chunk and process it:
Chunk chunk = this.chunkSource.Receive()
ProcessChunk(chunk);
}
// if here: chunkSource has been completed. No more data to expect
}
Put it all together:
private async Task ProcessAsync()
{
ChunkProducer producer = new ChunkProducer();
ChunkConsumer consumer = new ChunkConsumer(producer.OutputBuffer);
// start a thread for the consumer to consume:
Task consumeTask = Task.Run( () => consumer.ConsumeAsync());
// let this thread start producing, and await until it is completed
await producer.ProduceAsync();
// if here, I know the producer finished producing
// wait until the consumer finished consuming:
await consumeTask;
// finished, all produced data is consumed.
}
Possible enhancements:
If producing is faster than consuming, consider using multiple consumers listening to the same ISourceBlock. Check TPL to see which of the BufferBlock types can handle multiple listeners
If producing is slower than consuming, consider using multiple producers producing to the same ITargetBlock. Check which type of buffer block can handle this.
Consider enabling cancellation using CancellationToken
If your chunk is not always the same number of records, consider using a batch block: The consumer gets notified if the batch has enough records to process.
You can use the DataFlow library to do something like this:
ActionBlock<Record[]> action_block = new ActionBlock<Record[]>(
x => ConsumeRecords(x),
new ExecutionDataflowBlockOptions
{
//Use one thread to process data.
//You can increase it if you want
//That would make sense if you produce the records faster than you consume them
MaxDegreeOfParallelism = 1
});
for (int i = 0; i < 1000; i++)
{
action_block.Post(ProduceNext100Records());
}
I am assuming that you have a method called ProduceNext100Records that produces records (e.g. via web service call) and another method called ConsumeRecords that consumes the records.
The easy answer I think is to use Microsoft Reactive Extensions (NuGet "Rx-Main").
Then you can do something like this:
var query =
from records in Get100Records().ToObservable()
from record in records.ToObservable()
from result in Observable.Start(() => ProcessRecord(record))
select new { record, result };
IDisposable subscription =
query
.Subscribe(
rr =>
{
/* Process each `rr.record`/`rr.result`
as they are produced */
},
() => { /* Run when all completed */ });
This will process in parallel and you'll start getting results as soon as the first ProcessRecord call is completed.
If you need to stop the processing early you just call subscription.Dispose().
Right now, I've got a C# program that performs the following steps on a recurring basis:
Grab current list of tasks from the database
Using Parallel.ForEach(), do work for each task
However, some of these tasks are very long-running. This delays the processing of other pending tasks because we only look for new ones at the start of the program.
Now, I know that modifying the collection being iterated over isn't possible (right?), but is there some equivalent functionality in the C# Parallel framework that would allow me to add work to the list while also processing items in the list?
Generally speaking, you're right that modifying a collection while iterating it is not allowed. But there are other approaches you could be using:
Use ActionBlock<T> from TPL Dataflow. The code could look something like:
var actionBlock = new ActionBlock<MyTask>(
task => DoWorkForTask(task),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded });
while (true)
{
var tasks = GrabCurrentListOfTasks();
foreach (var task in tasks)
{
actionBlock.Post(task);
await Task.Delay(someShortDelay);
// or use Thread.Sleep() if you don't want to use async
}
}
Use BlockingCollection<T>, which can be modified while consuming items from it, along with GetConsumingParititioner() from ParallelExtensionsExtras to make it work with Parallel.ForEach():
var collection = new BlockingCollection<MyTask>();
Task.Run(async () =>
{
while (true)
{
var tasks = GrabCurrentListOfTasks();
foreach (var task in tasks)
{
collection.Add(task);
await Task.Delay(someShortDelay);
}
}
});
Parallel.ForEach(collection.GetConsumingPartitioner(), task => DoWorkForTask(task));
Here is an example of an approach you could try. I think you want to get away from Parallel.ForEaching and do something with asynchronous programming instead because you need to retrieve results as they finish, rather than in discrete chunks that could conceivably contain both long running tasks and tasks that finish very quickly.
This approach uses a simple sequential loop to retrieve results from a list of asynchronous tasks. In this case, you should be safe to use a simple non-thread safe mutable list because all of the mutation of the list happens sequentially in the same thread.
Note that this approach uses Task.WhenAny in a loop which isn't very efficient for large task lists and you should consider an alternative approach in that case. (See this blog: http://blogs.msdn.com/b/pfxteam/archive/2012/08/02/processing-tasks-as-they-complete.aspx)
This example is based on: https://msdn.microsoft.com/en-GB/library/jj155756.aspx
private async Task<ProcessResult> processTask(ProcessTask task)
{
// do something intensive with data
}
private IEnumerable<ProcessTask> GetOutstandingTasks()
{
// retreive some tasks from db
}
private void ProcessAllData()
{
List<Task<ProcessResult>> taskQueue =
GetOutstandingTasks()
.Select(tsk => processTask(tsk))
.ToList(); // grab initial task queue
while(taskQueue.Any()) // iterate while tasks need completing
{
Task<ProcessResult> firstFinishedTask = await Task.WhenAny(taskQueue); // get first to finish
taskQueue.Remove(firstFinishedTask); // remove the one that finished
ProcessResult result = await firstFinishedTask; // get the result
// do something with task result
taskQueue.AddRange(GetOutstandingTasks().Select(tsk => processData(tsk))) // add more tasks that need performing
}
}
I have a situation in which I have a producer/consumer scenario. The producer never stops, which means that even if there is a time where there are no items in the BC, further items can be added later.
Moving from .NET Framework 3.5 to 4.0, I decided to use a BlockingCollection as a concurrent queue between the consumer and the producer. I even added some parallel extensions so I could use the BC with a Parallel.ForEach.
The problem is that, in the consumer thread, I need to have a kind of an hybrid model:
Im always checking the BC to process any item that arrived with a
Parallel.ForEach(bc.GetConsumingEnumerable(), item => etc
Inside this foreach, I execute all the tasks that dont depend between each other.
Here comes the problem. After paralelizing the previous tasks I need to manage their results in the same FIFO order in which they were in the BC. The processing of these results should be made in a sync thread.
A little example in pseudo code follows:
producer:
//This event is triggered each time a page is scanned. Any batch of new pages can be added at any time at the scanner
private void Current_OnPageScanned(object sender, ScannedPage scannedPage)
{
//The object to add has a property with the sequence number
_concurrentCollection.TryAdd(scannedPage);
}
consumer:
private void Init()
{
_cancelTasks = false;
_checkTask = Task.Factory.StartNew(() =>
{
while (!_cancelTasks)
{
//BlockingCollections with Parallel ForEach
var bc = _concurrentCollection;
Parallel.ForEach(bc.GetConsumingEnumerable(), item =>
{
ScannedPage currentPage = item;
// process a batch of images from the bc and check if an image has a valid barcode. T
});
//Here should go the code that takes the results from each tasks, process them in the same FIFO order in which they entered the BC and save each image to a file, all of this in this same thread.
}
});
}
Obviously, this cant work as it is because the .GetConsumingEnumerable() blocks until there is another item in the BC. I asume I could do it with tasks and just fire 4 or 5 task in a same batch, but:
How could I do this with tasks and still have a waiting point before the start of the tasks that blocks until there is an item to be consumed in the BC (I don't want to start processing if there is nothing. Once there is something in the BC i would just start the batch of 4 tasks, and use a TryTake inside each one so if there is nothing to take they don't block, because I don't know if I can always reach the number of items from the BC as the batch of tasks, for example, just one item left in the BC and a batch of 4 tasks) ?
How could I do this and take advantage of the efficiency that Parallel.For offers?
How could I save the results of the tasks in the same FIFO order in which the items were extracted from the BC?
Is there any other concurrency class more suited to this kind of hybrid processing of items in the consumer?
Also, this is my first question ever made in StackOverflow, so if you need any more data or you just think that my question is not correct just let me know.
I think I follow what you're asking, why not create a ConcurrentBag and add to it while processing like this:
while (!_cancelTasks)
{
//BlockingCollections with Paralell ForEach
var bc = _concurrentCollection;
var q = new ConcurrentBag<ScannedPage>();
Parallel.ForEach(bc.GetConsumingEnumerable(), item =>
{
ScannedPage currentPage = item;
q.Add(item);
// process a batch of images from the bc and check if an image has a valid barcode. T
});
//Here should go the code that takes the results from each tasks, process them in the same FIFO order in which they entered the BC and save each image to a file, all of this in this same thread.
//process items in your list here by sorting using some sequence key
var items = q.OrderBy( o=> o.SeqNbr).ToList();
foreach( var item in items){
...
}
}
This obviously doesn't enqueue them in the exact order they were added to the BC but you could add some sequence nbr to the ScannedPage object like Alex suggested and then sort the results after.
Here's how I'd handle the sequence:
Add this to the ScannedPage class:
public static int _counter; //public because this is just an example but it would work.
Get a sequence nbr and assign here:
private void Current_OnPageScanned(object sender, ScannedPage scannedPage)
{
lock( this){ //to single thread this process.. not necessary if it's already single threaded of course.
System.Threading.Interlocked.Increment( ref ScannedPage._counter);
scannedPage.SeqNbr = ScannedPage._counter;
...
}
}
Whenever you need the results of a parallel operation, using PLINQ is generally more convenient that using the Parallel class. Here is how you could refactor your code using PLINQ:
private void Init()
{
_cancelTasks = new CancellationTokenSource();
_checkTask = Task.Run(() =>
{
while (true)
{
_cancelTasks.Token.ThrowIfCancellationRequested();
var bc = _concurrentCollection;
var partitioner = Partitioner.Create(
bc.GetConsumingEnumerable(_cancelTasks.Token),
EnumerablePartitionerOptions.NoBuffering);
ScannedPage[] results = partitioner
.AsParallel()
.AsOrdered()
.Select(scannedPage =>
{
// Process the scannedPage
return scannedPage;
})
.ToArray();
// Process the results
}
});
}
The .AsOrdered() is what ensures that you'll get the results in the same order as the input.
Be aware that when you consume a BlockingCollection<T> with the Parallel class or PLINQ, it is important to use the Partitioner and the EnumerablePartitionerOptions.NoBuffering configuration, otherwise there is a risk of deadlocks. The default greedy behavior of the Parallel/PLINQ and the blocking behavior of the BlockingCollection<T>, do not interact well.