I used to have:
using (MyWebClient client = new MyWebClient(TimeoutInSeconds))
{
var res = client.DownloadData(par.Base_url);
//code that checks res
}
Now I have:
using (MyWebClient client = new MyWebClient(TimeoutInSeconds))
{
client.DownloadDataAsync(new Uri(par.Base_url));
client.DownloadDataCompleted += (sender, e) =>
{
//code that checks e.Result
}
}
Where MyWebClient is derived from WebClient.
Now I have lots of threads doing this and in the first case memory consumption wasn't an issue while in the second one I see steady rise in memory until I get OutOfMemoryException.
I profiled and it seems that WebClient is the culprit, not being disposed and downloaded data is kept. But why? What's the difference between two cases? Perhaps e.Result needs to be somehow disposed of?
Your first case limits the number of concurrent downloads to the number of threads. Your second case has no limit on the number of concurrent downloads.
You are disposing of your WebClient immediately in the second option. You have a couple of choices:
If you're using .NET 4.5 (or .NET 4.0 with Visual Studio 2012 and the AsyncTargetingPack installed), you can do var res = await client.DownloadDataAsync(par.Base_url); and have code that looks similar to your first line but is actually asynchronous.
Use a normal continuation and get rid of your using block
The first option would look like this:
using (MyWebClient client = new MyWebClient(TimeoutInSeconds))
{
var res = await client.DownloadDataAsync(par.Base_url);
//code that checks res
}
The second option would look like this:
var client = new MyWebClient(TimeoutInSeconds);
client.DownloadDataAsync(new Uri(par.Base_url))
.ContinueWith(t =>
{
client.Dispose();
var res = t.Result;
//code that checks res
}
}
HOWEVER
You must change your threading approach depending on which solution you use. The first version of your code runs synchronously, so if you have a thread dedicated to a URL (or connection or however it is you're splitting them up), the downloads will run synchronously on that thread and block it. If you choose either of these options, however, you'll end up using IO completion threads to complete your work, splitting it out from the main thread. In the long run, this is probably better, but it means you have to be mindful about how many of these requests you submit in parallel.
Related
I've got a problem where I have to process a large batch of large jsonl files (read, deserialize, do some transforms db lookups etc, then write the transformed results in a .net core console app.
I've gotten better throughput by putting the output in batches on a separate thread and was trying to improve the processing side by adding some parallelism but the overhead ended up being self defeating.
I had been doing:
using (var stream = new FileStream(_filePath, FileMode.Open))
using (var reader = new StreamReader(stream)
{
for (;;)
{
var l = reader.ReadLine();
if (l == null)
break;
// Deserialize
// Do some database lookups
// Do some transforms
// Pass result to output thread
}
}
And some diagnostic timings showed me that the ReadLine() call was taking more than the deserialization, etc. To put some numbers on that, a large file would have about:
11 seconds spent on ReadLine
7.8 seconds spend on serialization
10 seconds spent on db lookups
I wanted to overlap that 11 seconds of file i/o with the other work so I tried
using (var stream = new FileStream(_filePath, FileMode.Open))
using (var reader = new StreamReader(stream)
{
var nextLine = reader.ReadLineAsync();
for (;;)
{
var l = nextLine.Result;
if (l == null)
break;
nextLine = reader.ReadLineAsync();
// Deserialize
// Do some database lookups
// Do some transforms
// Pass result to output thread
}
}
To get the next I/O going while I did the transform stuff. Only that ended up taking a lot longer than the regular sync stuff (like twice as long).
I've got requirements that they want predictability on the overall result (i.e. the same set of files have to be processed in name order and the output rows have to be predictably in the same order) so I can't just throw a file per thread and let them fight it out.
I was just trying to introduce enough parallelism to smooth the throughput over a large set of inputs, and I was surprised how counterproductive the above turned out to be.
Am I missing something here?
The built-in asynchronous filesystem APIs are currently broken, and you are advised to avoid them. Not only they are much slower than their synchronous counterparts, but they are not even truly asynchronous. The .NET 6 will come with an improved FileStream implementation, so in a few months this may no longer be an issue.
What you are trying to achieve is called task-parallelism, where two or more heterogeneous operations are running concurrently and independently from each other. It's an advanced technique and it requires specialized tools. The most common type of parallelism is the so called data-parallelism, where the same type of operation is running in parallel on a list of homogeneous data, and it's commonly implemented using the Parallel class or the PLINQ library.
To achieve task-parallelism the most readily available tool is the TPL Dataflow library, which is built-in the .NET Core / .NET 5 platforms, and you only need to install a package if you are targeting the .NET Framework. This library allows you to create a pipeline consisting of linked components that are called "blocks" (TransformBlock, ActionBlock, BatchBlock etc), where each block acts as an independent processor with its own input and output queues. You feed the pipeline with data, and the data flows from block to block through the pipeline, while being processed along the way. You Complete the first block in the pipeline to signal that no more input data will ever be available, and then await the Completion of the last block to make your code wait until all the work has been done. Here is an example:
private async void Button1_Click(object sender, EventArgs e)
{
Button1.Enabled = false;
var fileBlock = new TransformManyBlock<string, IList<string>>(filePath =>
{
return File.ReadLines(filePath).Buffer(10);
});
var deserializeBlock = new TransformBlock<IList<string>, MyObject[]>(lines =>
{
return lines.Select(line => Deserialize(line)).ToArray();
}, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = 2 // Let's assume that Deserialize is parallelizable
});
var persistBlock = new TransformBlock<MyObject[], MyObject[]>(async objects =>
{
foreach (MyObject obj in objects) await PersistToDbAsync(obj);
return objects;
});
var displayBlock = new ActionBlock<MyObject[]>(objects =>
{
foreach (MyObject obj in objects) TextBox1.AppendText($"{obj}\r\n");
}, new ExecutionDataflowBlockOptions()
{
TaskScheduler = TaskScheduler.FromCurrentSynchronizationContext()
// Make sure that the delegate will be invoked on the UI thread
});
fileBlock.LinkTo(deserializeBlock,
new DataflowLinkOptions { PropagateCompletion = true });
deserializeBlock.LinkTo(persistBlock,
new DataflowLinkOptions { PropagateCompletion = true });
persistBlock.LinkTo(displayBlock,
new DataflowLinkOptions { PropagateCompletion = true });
foreach (var filePath in Directory.GetFiles(#"C:\Data"))
await fileBlock.SendAsync(filePath);
fileBlock.Complete();
await displayBlock.Completion;
MessageBox.Show("Done");
Button1.Enabled = true;
}
The data passed through the pipeline should be chunky. If each unit of work is too lightweight, you should batch them in arrays or lists, otherwise the overhead of moving lots of tiny data around is going to outweigh the benefits of parallelism. That's the reason for using the Buffer LINQ operator (from the System.Interactive package) in the above example. The .NET 6 will come with a new Chunk LINQ operator, offering the same functionality.
Theodor's suggestion looks like a really powerful and useful library that's worth checking out, but if you're looking for a smaller DIY solution this is how I would approach it:
using System;
using System.IO;
using System.Threading.Tasks;
using System.Collections.Generic;
namespace Parallelism
{
class Program
{
private static Queue<string> _queue = new Queue<string>();
private static Task _lastProcessTask;
static async Task Main(string[] args)
{
string path = "???";
await ReadAndProcessAsync(path);
}
private static async Task ReadAndProcessAsync(string path)
{
using (var str = File.OpenRead(path))
using (var sr = new StreamReader(str))
{
string line = null;
while (true)
{
line = await sr.ReadLineAsync();
if (line == null)
break;
lock (_queue)
{
_queue.Enqueue(line);
if (_queue.Count == 1)
// There was nothing in the queue before
// so initiate a new processing loop. Save
// but DON'T await the Task yet.
_lastProcessTask = ProcessQueueAsync();
}
}
}
// Now that file reading is completed, await
// _lastProcessTask to ensure we don't return
// before it's finished.
await _lastProcessTask;
}
// This will continue processing as long as lines are in the queue,
// including new lines entering the queue while processing earlier ones.
private static Task ProcessQueueAsync()
{
return Task.Run(async () =>
{
while (true)
{
string line;
lock (_queue)
{
// Only peak at first so the read loop doesn't think
// the queue is empty and initiate a second processing
// loop while we're processing this line.
if (!_queue.TryPeek(out line))
return;
}
await ProcessLineAsync(line);
lock (_queue)
{
// Dequeues the item we just processed. If it's the last
// one, this loop is done.
_queue.Dequeue();
if (_queue.Count == 0)
return;
}
}
});
}
private static async Task ProcessLineAsync(string line)
{
// do something
}
}
}
Note this approach has a processing loop that terminates when nothing is left in the queue, and is re-initiated if needed when new items are ready. Another approach would be to have a continuous processing loop that repeatedly re-checks and does a Task.Delay() for a small amount of time while the queue is empty. I like my approach better because it doesn't bog down the worker thread with periodic and unnecessary checks but performance would likely be unnoticeably different.
Also just to comment on Blindy's answer, I have to disagree with discouraging the use of parallelism here. First off, most CPUs these days are multi-core, so smart use of the .NET threadpool will in fact maximize your application's efficiency when run on multi-core CPUs and have pretty minimal downside in single-core scenarios.
More importantly, though, async does not equal multithreading. Asynchronous programming existed long before multithreading, I/O being the most notable example. I/O operations are in large part handled by hardware other than the CPU - the NIC, SATA controllers, etc. They use an ancient concept called the Hardware Interrupt that most coders today have probably never heard of and predates multithreading by decades. It's basically just a way to give the CPU a callback to execute when an off-CPU operation is finished. So when you use a well-behaved asynchronous API (notwithstanding that .NET FileStream has issues as Theodore mentioned), your CPU really shouldn't be doing that much work at all. And when you await such an API, the CPU is basically sitting idle until the other hardware in the machine has written the requested data to RAM.
I agree with Blindy that it would be better if computer science programs did a better job of teaching people how computer hardware actually works. Looking to take advantage of the fact that the CPU can be doing other things while waiting for data to be read off the disk, off a network, etc., is, in the words of Captain Kirk, "officer thinking".
11 seconds spent on ReadLine
More like, specifically, 11 seconds spent on file I/O, but you didn't measure that.
Replace your stream creation with this instead:
using var reader = new StreamReader(_filePath, Encoding.UTF8, false, 50 * 1024 * 1024);
That will cause it to read it to a buffer of 50MB (play with the size as needed) to avoid repeated I/O on what seems like an ancient hard drive.
I was just trying to introduce enough parallelism to smooth the throughput
Not only did you not introduce any parallelism at all, but you used ReadLineAsync wrong -- it returns a Task<string>, not a string.
It's completely overkill, the buffer size increase will most likely fix your issue, but if you want to actually do this you need two threads that communicate over a shared data structure, as Peter said.
Only that ended up taking a lot longer than the regular sync stuff
It baffles me that people think multi-threaded code should take less processing power than single-threaded code. There has to be some really basic understanding missing from present day education to lead to this. Multi-threading includes multiple extra context switches, mutex contention, your OS scheduler kicking in to replace one of your threads (leading to starvation or oversaturation), gathering, serializing and aggregating results after work is done etc. None of that is free or easy to implement.
I have a task where I form thousands of requests which are later sent to a server. The server returns the response for each request and that response is then dumped to an output file line by line.
The pseudo code goes like this:
//requests contains thousands of requests to be sent to the server
string[] requests = GetRequestsString();
foreach(string request in requests)
{
string response = MakeWebRequest(request);
ParseandDump(response);
}
Now, as can be seen the serve is handling my requests one by one. I want to make this entire process fast. The server in question is capable of handling multiple requests at a time. I want to apply multi-threading and send let's say 4 requests to the server at a time and dump the response in same thread.
Can you please give me any pointer to possible approaches.
You can take advantage of Task from .NET 4.0 and the new toy HttpClient, sample code below is showed how you send requests in parallel, then dump response in the same thread by using ContinueWith:
var httpClient = new HttpClient();
var tasks = requests.Select(r => httpClient.GetStringAsync(r).ContinueWith(t =>
{
ParseandDump(t.Result);
}));
Task uses ThreadPool under the hood, so you don't need to specify how many threads should be used, ThreadPool will manage this for you in optimized way.
The easiest way would be to use Parallel.ForEach like this:
string[] requests = GetRequestsString();
Parallel.ForEach(requests, request => ParseandDump(MakeWebRequest(request)));
.NET framework 4.0 or greater is required to use Parallel.
I think this could be done in a consumer-producer-pattern. You could use a ConcurrentQueue (from the namespace System.Collections.Concurrent) as a shared resource between the many parallel WebRequests and the dumping thread.
The pseudo code would be something like:
var requests = GetRequestsString();
var queue = new ConcurrentQueue<string>();
Task.Factory.StartNew(() =>
{
Parallel.ForEach(requests , currentRequest =>
{
queue.Enqueue(MakeWebRequest(request));
}
});
Task.Factory.StartNew(() =>
{
while (true)
{
string response;
if (queue.TryDequeue(out response))
{
ParseandDump(response);
}
}
});
Maybe a BlockingCollection might serve you even better, depending on how you want to go about synchronizing the threads to signal the end of incoming requests.
I am using System.Net.Http to use network resources. When running on a single thread it works perfectly. When I run the code via TPL, it hangs and never completes until the timeout is hit.
What happens is that all the threads end up waiting on the sendTask.Result line. I am not sure what they are waiting on, but I assume it is something in HttpClient.
The networking code is:
using (var request = new HttpRequestMessage(HttpMethod.Get, "http://google.com/"))
{
using (var client = new HttpClient())
{
var sendTask = client.SendAsync
(request, HttpCompletionOption.ResponseHeadersRead);
using (var response = sendTask.Result)
{
var streamTask = response.Content.ReadAsStreamAsync();
using (var stream = streamTask.Result)
{
// problem occurs in line above
}
}
}
}
The TPL code that I am using is as follows. The Do method contains exactly the code above.
var taskEnumerables = Enumerable.Range(0, 100);
var tasks = taskEnumerables.Select
(x => Task.Factory.StartNew(() => _Do(ref count))).ToArray();
Task.WaitAll(tasks);
I have tried a couple of different schedulers, and the only way that I can get it to work is to write a scheduler that limits the number of running tasks to 2 or 3. However, even this fails sometimes.
I would assume that my problem is in HttpClient, but for the life of me I can't see any shared state in my code. Does anyone have any ideas?
Thanks,
Erick
I finally found the issue. The problem was that HttpClient issues its own additional tasks, so a single task that I start might actually end spawning 5 or more tasks.
The scheduler was configured with a limit on the number of tasks. I started the task, which caused the number of running tasks to hit the max limit. The HttpClient then attempted to start its own tasks, but because the limit was reached, it blocked until the number of tasks went down, which of course never happened, as they were waiting for my tasks to finish. Hello deadlock.
The morals of the story:
Tasks might be a global resource
There are often non-obvious interdependencies between tasks
Schedulers are not easy to work with
Don't assume that you control either schedulers or number of tasks
I ended up using another method to throttle the number of connections.
Erick
I'm working with LogParser 2.2 in C# via COM interop and would like to be able to give long running queries a timeout.
e.g.
var ctx = new COMIISW3CInputContextClassClass();
var log = new LogQueryClassClass();
var rs = log.Execute(qry, ctx);
Is it possible to interrupt the log.Execute call if it takes too long?
I have tried Thread.Abort(), but it appears the ThreadAbortException waits until the Execute call finishes normally.
The code used to test Thread.Abort() is:
var ctx = new COMIISW3CInputContextClassClass();
var log = new LogQueryClassClass();
ILogRecordset rs = null;
var t = new Thread(() =>
{
rs = log.Execute(qry, ctx);
});
t.SetApartmentState(ApartmentState.STA);
t.Start();
t.Join(100);
t.Abort();
// this tests if the file lock is still held by log parser
Assert.Throws<IOException>(() =>
File.OpenWrite(path));
t.join(10000);
// file is no longer locked
using (File.OpenWrite(path))
Assert.IsTrue(true);
Cancelling execution with Thread.Abort is rarely safe. It is not safe in this case as there is no way to know the state of Log Parser's data structures. You could perform the query in a separate process which you communicate with using standard input/output, named pipes, WCF, or your favorite IPC technology. Then to cancel the query, use Process.Kill on that process.
NOTE: Don't know if Log Parser writes any temp files that this will leave behind. Other than that, Windows should clean up all state correctly for you.
See cancellation tokens: http://msdn.microsoft.com/en-us/library/dd997289.aspx
i've got a list of objects which I wish to copy from one source to another. It was suggested that I could speed things up by using Parallel.ForEach
How can I refactor the following pseduo code to leverage Parallel.ForEach(..) ?
var foos = GetFoos().ToList();
foreach(var foo in foos)
{
CopyObjectFromOldBucketToNewBucket(foo, oldBucket, newBucket,
accessKeyId, secretAccessKey);
}
CopyObjectFromOldBucketToNewBucket uses the Amazon REST APIs to move items from one bucket to another.
Cheers :)
Parallel is actually not the best option here. Parallel will run your code in parallel but will still use up a thread pool thread for each request to AWS. It would be far better use of resources to use the BeginCopyObject method instead. This will not use up a thread pool thread waiting on a response but will only utilize it when the response is received and needs to be processed.
Here's a simplified example of how to use Begin/End methods. These are not specific to AWS but is a pattern found throughout the .NET BCL.
public static CopyFoos()
{
var client = new AmazonS3Client(...);
var foos = GetFoos().ToList();
var asyncs = new List<IAsyncResult>();
foreach(var foo in foos)
{
var request = new CopyObjectRequest { ... };
asyncs.Add(client.BeginCopyObject(request, EndCopy, client));
}
foreach(IAsyncResult ar in asyncs)
{
if (!ar.IsCompleted)
{
ar.AsyncWaitHandle.WaitOne();
}
}
}
private static EndCopy(IAsyncRequest ar)
{
((AmazonS3Client)ar.AsyncState).EndCopyObject(ar);
}
For production code you may want to keep track of how many requests you've dispatched and only send out a limited number at any one time. Testing or AWS docs may tell you how many concurrent requests are optimal.
In this case we don't actually need to do anything when the requests are completed so you may be tempted to skip the EndCopy calls but that would cause a resource leak. Whenever you call BeginXxx you must call the corresponding EndXxx method.
Since your code doesn't have any dependencies other than to foos you can simply do:
Parallel.ForEach(foos, ( foo =>
{
CopyObjectFromOldBucketToNewBucket(foo, oldBucket, newBucket,
accessKeyId, secretAccessKey);
}));
Keep in mind though, that I/O can only be parallelized to a certain degree, after that performance might actually degrade.