Multiple consumers update result array inconsistently - c#

I have a large file, each row can be process separately, so I launch one reader, and multiple parsers.
The each parser will write result back to a result holder array for further process.
I found if I launch more parser, the result holder array gives different content each time, no matter if I use ConcurrentQueue or BlockingCollection or some other things
I repeatedly run the program and output the result array many times, each time will give different if I use more than 1 parsers.
string[] result = new string[nRow];
static BlockingCollection<queueItem> myBlk = new BlockingCollection<queueItem>();
static void Main()
{
Reader();
}
static void parserThread()
{
while (myBlk.IsCompleted == false)
{
queueItem one;
if (myBlk.TryTake(out one) == false)
{
System.Threading.Thread.Sleep(tSleep);
}
else
{
oneDataRow(one.seqIndex, one.line);
}
}
}
static void oneDataRow(int rowIndex, string line)
{
result[rowIndex] = // some process with line
}
static void Reader()
{
for (int i = 0; i < 10; i++)
{
Task t = new Task(() => parserThread());
t.Start();
}
StreamReader sr = new StreamReader(path);
string line;
int nRead=0;
while((line = sr.ReadLine()) != null)
{
string innerLine = line;
int innerN = nRead;
myBlk.Add(new queueItem(innerN, innerLine));
nRead++;
}
siteBlk.CompleteAdding();
sw.close();
while (myBlk.IsCompleted == false)
{
System.Threading.Thread.Sleep(tSleep);
}
}
class queueItem
{
public int seqIndex = 0;
public string line = "";
public queueItem(int RowOrder, string content)
{
seqIndex = RowOrder;
line = content;
}
}

The way you are waiting for the process to complete is problematic:
while (myBlk.IsCompleted == false)
{
System.Threading.Thread.Sleep(tSleep);
}
Here is the description of the IsCompleted property:
Gets whether this BlockingCollection<T> has been marked as complete for adding and is empty.
In your case the completion of the BlockingCollection should not signal the completion of the whole operation, because the last lines taken from the collection may not be processed yet.
Instead you should store the worker tasks into an array (or list), and wait them to complete.
Task.WaitAll(tasks);
In general you should rarely use the IsCompleted property for anything other than for logging debug information. Using it for controlling the execution flow introduces race conditions in most cases.

Related

How to avoid collection modification during JSON serialization in looped multithreaded task?

I have a problem during serialization to JSON file, when using Newtonsoft.Json.
In a loop I am fiering tasks in various threads:
List<Task> jockeysTasks = new List<Task>();
for (int i = 1; i < 1100; i++)
{
int j = i;
Task task = Task.Run(async () =>
{
LoadedJockey jockey = new LoadedJockey();
jockey = await Task.Run(() => _scrapServices.ScrapSingleJockeyPL(j));
if (jockey.Name != null)
{
_allJockeys.Add(jockey);
}
UpdateStatusBar = j * 100 / 1100;
if (j % 100 == 0)
{
await Task.Run(() => _dataServices.SaveAllJockeys(_allJockeys)); //saves everything to JSON file
}
});
jockeysTasks.Add(task);
}
await Task.WhenAll(jockeysTasks);
And if (j % 100 == 0), it is rying to save the collection _allJockeys to the file (I will make some counter to make it more reliable, but that is not the point):
public void SaveAllJockeys(List<LoadedJockey> allJockeys)
{
if (allJockeys.Count != 0)
{
if (File.Exists(_jockeysFileName)) File.Delete(_jockeysFileName);
try
{
using (StreamWriter file = File.CreateText(_jockeysFileName))
{
JsonSerializer serializer = new JsonSerializer();
serializer.Serialize(file, allJockeys);
}
}
catch (Exception e)
{
dialog.ShowDialog("Could not save the results, " + e.ToString(), "Error");
}
}
}
During that time, as I belive, another tasks are adding new collection item to the collection, and it is throwing to me the exception:
Collection was modified; enumeration operation may not execute.
As I was reading in THE ARTICLE, you can change type of iteration to avoid an exception. As far as I know, I can not modify the way, how Newtonsoft.Json pack is doing it.
Thank you in advance for any tips how to avoid the exception and save the collection wihout unexpected changes.
You should probably inherit from List and use a ReaderWriterLock (https://learn.microsoft.com/en-us/dotnet/api/system.threading.readerwriterlock?view=netframework-4.8)
i.e. (not tested pseudo C#)
public class MyJockeys: List<LoadedJockey>
{
System.Threading.ReaderWriterLock _rw_lock = new System.Threading.ReaderWriterLock();
public new Add(LoadedJockey j)
{
try
{
_rw_lock.AcquireWriterLock(5000); // or whatever you deem an acceptable timeout
base.Add(j);
}
finally
{
_rw_lock.ReleaseWriterLock();
}
}
public ToJSON()
{
try
{
_rw_lock.AcquireReaderLock(5000); // or whatever you deem an acceptable timeout
string s = ""; // Serialize here using Newtonsoft
return s;
}
finally
{
_rw_lock.ReleaseReaderLock();
}
}
// And override Remove and anything else you need
}
Get the idea?
Hope this helps.
Regards,
Adam.
I tied to use ToList() on the collection, what creates copy of the list, with positive effect.

How to prevent threads using the same variables

I have a multi-line textbox and I want to process each line with multi threads.
The textbox could have a lot of lines (1000+), but not as many threads. I want to use custom amount of threads to read all those 1000+ lines without any duplicates (as in each thread reading UNIQUE lines only, if a line has been read by other thread, not to read it again).
What I have right now:
private void button5_Click(object sender, EventArgs e)
{
for (int i = 0; i < threadCount; i++)
{
new Thread(new ThreadStart(threadJob)).Start();
}
}
private void threadJob()
{
for (int i = 0; i < txtSearchTerms.Lines.Length; i++)
{
lock (threadLock)
{
Console.WriteLine(txtSearchTerms.Lines[i]);
}
}
}
It does start the correct amount of threads, but they all read the same variable multiple times.
Separate data collection and data processing and next possible steps after calculation. You can safely collect results calculated in parallel by using ConcurrentBag<T>, which is simply thread-safe collection.
Then you don't need to worry about "locking" objects and all lines will be "processed" only once.
1. Collect data
2. Execute collected data in parallel
3. Handle calculated result
private string Process(string line)
{
// Your logic for given line
}
private void Button_Click(object sender, EventArgs e)
{
var results = new ConcurrentBag<string>();
Parallel.ForEach(txtSearchTerms.Lines,
line =>
{
var result = Process(line);
results.Add(result);
});
foreach (var result in results)
{
Console.WriteLine(result);
}
}
By default Parallel.ForEach will use as much threads as underlying scheduler provides.
You can control amount of used threads by passing instance of ParallelOptions to the Parallel.ForEach method.
var options = new ParallelOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount
};
var results = new ConcurrentBag<string>();
Parallel.ForEach(values,
options,
value =>
{
var result = Process(value);
results.Add(result);
});
Consider using Parallel.ForEach to iterate over the Lines array. It is just like a normal foreach loop (i.e. each value will be processed only once), but the work is done in parallel - with multiple Tasks (threads).
var data = txtSearchTerms.Lines;
var threadCount = 4; // or whatever you want
Parallel.ForEach(data,
new ParallelOptions() { MaxDegreeOfParallelism = threadCount },
(val) =>
{
//Your code here
Console.WriteLine(val);
});
The above code will need this line to be added at the top of your file:
using System.Threading.Tasks;
Alternatively if you want to not just execute something, but also return / project something then instead try:
var results = data.AsParallel(new ParallelLinqOptions()
{
MaxDegreeOfParallelism = threadCount
}).Select(val =>
{
// Your code here, I just return the value but you could return whatever you want
return val;
}).ToList();
which still executes the code in parallel, but also returns a List (in this case with the same values in the original TextBox). And most importantly, the List will be in the same order as your input.
There many ways to do it what you want.
Take an extra class field:
private int _counter;
Use it instead of loop index. Increment it inside the lock:
private void threadJob()
{
while (true)
{
lock (threadLock)
{
if (_counter >= txtSearchTerms.Lines.Length)
return;
Console.WriteLine(txtSearchTerms.Lines[_counter]);
_counter++;
}
}
}
It works, but it very inefficient.
Lets consider another way. Each thread will handle its part of the dataset independently from the others.
public void button5_Click(object sender, EventArgs e)
{
for (int i = 0; i < threadCount; i++)
{
new Thread(new ParameterizedThreadStart(threadJob)).Start(i);
}
}
private void threadJob(object o)
{
int threadNumber = (int)o;
int count = txtSearchTerms.Lines.Length / threadCount;
int start = threadNumber * count;
int end = threadNumber != threadCount - 1 ? start + count : txtSearchTerms.Lines.Length;
for (int i = start; i < end; i++)
{
Console.WriteLine(txtSearchTerms.Lines[i]);
}
}
This is more efficient because threads do not wait on the lock. However, the array elements are processed not in a general manner.

Using threads to parse multiple Html pages faster

Here's what I'm trying to do:
Get one html page from url which contains multiple links inside
Visit each link
Extract some data from visited link and create object using it
So far All i did is just simple and slow way:
public List<Link> searchLinks(string name)
{
List<Link> foundLinks = new List<Link>();
// getHtmlDocument() just returns HtmlDocument using input url.
HtmlDocument doc = getHtmlDocument(AU_SEARCH_URL + fixSpaces(name));
var link_list = doc.DocumentNode.SelectNodes(#"/html/body/div[#id='parent-container']/div[#id='main-content']/ol[#id='searchresult']/li/h2/a");
foreach (var link in link_list)
{
// TODO Threads
// getObject() creates object using data gathered
foundLinks.Add(getObject(link.InnerText, link.Attributes["href"].Value, getLatestEpisode(link.Attributes["href"].Value)));
}
return foundLinks;
}
To make it faster/efficient I need to implement threads, but I'm not sure how i should approach it, because I can't just randomly start threads, I need to wait for them to finish, thread.Join() kind of solves 'wait for threads to finish' problem, but it becomes not fast anymore i think, because threads will be launched after earlier one is finished.
The simplest way to offload the work to multiple threads would be to use Parallel.ForEach() in place of your current loop. Something like this:
Parallel.ForEach(link_list, link =>
{
foundLinks.Add(getObject(link.InnerText, link.Attributes["href"].Value, getLatestEpisode(link.Attributes["href"].Value)));
});
I'm not sure if there are other threading concerns in your overall code. (Note, for example, that this would no longer guarantee that the data would be added to foundLinks in the same order.) But as long as there's nothing explicitly preventing concurrent work from taking place then this would take advantage of threading over multiple CPU cores to process the work.
Maybe you should use Thread pool :
Example from MSDN :
using System;
using System.Threading;
public class Fibonacci
{
private int _n;
private int _fibOfN;
private ManualResetEvent _doneEvent;
public int N { get { return _n; } }
public int FibOfN { get { return _fibOfN; } }
// Constructor.
public Fibonacci(int n, ManualResetEvent doneEvent)
{
_n = n;
_doneEvent = doneEvent;
}
// Wrapper method for use with thread pool.
public void ThreadPoolCallback(Object threadContext)
{
int threadIndex = (int)threadContext;
Console.WriteLine("thread {0} started...", threadIndex);
_fibOfN = Calculate(_n);
Console.WriteLine("thread {0} result calculated...", threadIndex);
_doneEvent.Set();
}
// Recursive method that calculates the Nth Fibonacci number.
public int Calculate(int n)
{
if (n <= 1)
{
return n;
}
return Calculate(n - 1) + Calculate(n - 2);
}
}
public class ThreadPoolExample
{
static void Main()
{
const int FibonacciCalculations = 10;
// One event is used for each Fibonacci object.
ManualResetEvent[] doneEvents = new ManualResetEvent[FibonacciCalculations];
Fibonacci[] fibArray = new Fibonacci[FibonacciCalculations];
Random r = new Random();
// Configure and start threads using ThreadPool.
Console.WriteLine("launching {0} tasks...", FibonacciCalculations);
for (int i = 0; i < FibonacciCalculations; i++)
{
doneEvents[i] = new ManualResetEvent(false);
Fibonacci f = new Fibonacci(r.Next(20, 40), doneEvents[i]);
fibArray[i] = f;
ThreadPool.QueueUserWorkItem(f.ThreadPoolCallback, i);
}
// Wait for all threads in pool to calculate.
WaitHandle.WaitAll(doneEvents);
Console.WriteLine("All calculations are complete.");
// Display the results.
for (int i= 0; i<FibonacciCalculations; i++)
{
Fibonacci f = fibArray[i];
Console.WriteLine("Fibonacci({0}) = {1}", f.N, f.FibOfN);
}
}
}

how to keep track of running location for a long running parallel program

I have a restartable program that runs over a very large space and I have started parallelizing it some. Each Task runs independently and updates a database with its results. It doesn't matter if tasks are repeated (they are fully deterministic based on the input array and will simply generate the same result they did before), but doing so is relatively inefficient. So far I have come up with the following pattern:
static void Main(string[] args) {
GeneratorStart = Storage.Load();
var tasks = new List<Task>();
foreach (int[] temp in Generator()) {
var arr = temp;
var task = new Task(() => {
//... use arr as needed
});
task.Start();
tasks.Add(task);
if (tasks.Count > 4) {
Task.WaitAll(tasks.ToArray());
Storage.UpdateStart(temp);
tasks = new List<Task>();
}
}
}
Prior to making the generator restartable, I had a simple Parallel.Foreach loop on it and was a bit faster. I think I am losing some CPU time with the WaitAll operation. How can I get rid of this bottleneck while keeping track of what tasks I don't have to run again when I restart?
Other bits for those concerned (shortened for brevity to question):
class Program {
static bool Done = false;
static int[] GeneratorStart = null;
static IEnumerable<int[]> Generator() {
var s = new Stack<int>();
//... omitted code to initialize stack to GeneratorStart for brevity
yield return s.ToArray();
while (!Done) {
Increment(s);
yield return s.Reverse().ToArray();
}
}
static int Base = 25600; //example number (none of this is important
static void Increment(Stack<int> stack) { //outside the fact
if (stack.Count == 0) { //that it is generating an array
stack.Push(1); //of a large base
return; //behaving like an integer
} //with each digit stored in an
int i = stack.Pop(); //array position)
i++;
if (i < Base) {
stack.Push(i);
return;
}
Increment(stack);
stack.Push(0);
}
}
I've come up with this:
var tasks = new Queue<Pair<int[],Task>>();
foreach (var temp in Generator()) {
var arr = temp;
tasks.Enqueue(new Pair<int[], Task>(arr, Task.Run(() ={
//... use arr as needed
}));
var tArray = t.Select(v => v.Value).Where(t=>!t.IsCompleted).ToArray();
if (tArray.Length > 7) {
Task.WaitAny(tArray);
var first = tasks.Peek();
while (first != null && first.B.IsCompleted) {
Storage.UpdateStart(first.A);
tasks.Dequeue();
first = tasks.Count == 0 ? null : tasks.Peek();
}
}
}
...
class Pair<TA,TB> {
public TA A { get; set; }
public TB B { get; set; }
public Pair(TA a, TB b) { A = a; B = b; }
}

How to use multi threading in a For loop

I want to achieve the below requirement; please suggest some solution.
string[] filenames = Directory.GetFiles("C:\Temp"); //10 files
for (int i = 0; i < filenames.count; i++)
{
ProcessFile(filenames[i]); //it takes time to execute
}
I wanted to implement multi-threading. e.g There are 10 files. I wanted to process 3 files at a time (configurable, say maxthreadcount). So 3 files will be processed in 3 threads from the for loop and if any thread completes the execution, it should pick the next item from the for loop. Also wanted to ensure all the files are processed before it exits the for loop.
Please suggest best approach.
Try
Parallel.For(0, filenames.Length, i => {
ProcessFile(filenames[i]);
});
MSDN
It's only available since .Net 4. Hope that acceptable.
This will do the job in .net 2.0:
class Program
{
static int workingCounter = 0;
static int workingLimit = 10;
static int processedCounter = 0;
static void Main(string[] args)
{
string[] files = Directory.GetFiles("C:\\Temp");
int checkCount = files.Length;
foreach (string file in files)
{
//wait for free limit...
while (workingCounter >= workingLimit)
{
Thread.Sleep(100);
}
workingCounter += 1;
ParameterizedThreadStart pts = new ParameterizedThreadStart(ProcessFile);
Thread th = new Thread(pts);
th.Start(file);
}
//wait for all threads to complete...
while (processedCounter< checkCount)
{
Thread.Sleep(100);
}
Console.WriteLine("Work completed!");
}
static void ProcessFile(object file)
{
try
{
Console.WriteLine(DateTime.Now.ToString() + " recieved: " + file + " thread count is: " + workingCounter.ToString());
//make some sleep for demo...
Thread.Sleep(2000);
}
catch (Exception ex)
{
//handle your exception...
string exMsg = ex.Message;
}
finally
{
Interlocked.Decrement(ref workingCounter);
Interlocked.Increment(ref processedCounter);
}
}
}
Take a look at the Producer/Consumer Queue example by Joe Albahari. It should provide a good starting point for what you're trying to accomplish.
You could use the ThreadPool.
Example:
ThreadPool.SetMaxThreads(3, 3);
for (int i = 0; i < filenames.count; i++)
{
ThreadPool.QueueUserWorkItem(new WaitCallback(ProcessFile), filenames[i]);
}
static void ProcessFile(object fileNameObj)
{
var fileName = (string)fileNameObj;
// do your processing here.
}
If you are using the ThreadPool elsewhere in your application then this would not be a good solution since it is shared across your app.
You could also grab a different thread pool implementation, for example SmartThreadPool
Rather than starting a thread for each file name, put the file names into a queue and then start up three threads to process them. Or, since the main thread is now free, start up two threads and let the main thread work on it, too:
Queue<string> MyQueue;
void MyProc()
{
string[] filenames = Directory.GetFiles(...);
MyQueue = new Queue(filenames);
// start two threads
Thread t1 = new Thread((ThreadStart)ProcessQueue);
Thread t2 = new Thread((ThreadStart)ProcessQueue);
t1.Start();
t2.Start();
// main thread processes the queue, too!
ProcessQueue();
// wait for threads to complete
t1.Join();
t2.Join();
}
private object queueLock = new object();
void ProcessQueue()
{
while (true)
{
string s;
lock (queueLock)
{
if (MyQueue.Count == 0)
{
// queue is empty
return;
}
s = MyQueue.Dequeue();
}
ProcessFile(s);
}
}
Another option is to use a semaphore to control how many threads are working:
Semaphore MySem = new Semaphore(3, 3);
void MyProc()
{
string[] filenames = Directory.GetFiles(...);
foreach (string s in filenames)
{
mySem.WaitOne();
ThreadPool.QueueUserWorkItem(ProcessFile, s);
}
// wait for all threads to finish
int count = 0;
while (count < 3)
{
mySem.WaitOne();
++count;
}
}
void ProcessFile(object state)
{
string fname = (string)state;
// do whatever
mySem.Release(); // release so another thread can start
}
The first will perform somewhat better because you don't have the overhead of starting and stopping a thread for each file name processed. The second is much shorter and cleaner, though, and takes full advantage of the thread pool. Likely you won't notice the performance difference.
Can set max threads unsing ParallelOptions
Parallel.For Method (Int32, Int32, ParallelOptions, Action)
ParallelOptions.MaxDegreeOfParallelism
var results = filenames.ToArray().AsParallel().Select(filename=>ProcessFile(filename)).ToArray();
bool ProcessFile(object fileNameObj)
{
var fileName = (string)fileNameObj;
// do your processing here.
return true;
}

Categories

Resources