I have developed an application that pulls X amount of records from my database per X amount of threads. Each thread then iterates the collections that are created and does some data validation for each record.
Once a record is validated / not validated by the application it is updated in the database as being valid / not valid. Records are only pulled by each thread if there are items in the database that have not been run through the application. There is a bit column to indicate if the application retrieved the data.
So, potentially, the system can run out of data depending on the number of threads and records per thread. I want the application to continue to check the database for any records that have not be run, then start the process of creating the threads, and finally validating the data.
Here is an example:
There are 50 records in the database
we are running 5 threads with 10 records per thread.
The application runs, the threads are created, the records are pulled and then processed. Now, the system is out of data. A user imports more data into the DB. The application, still looking to see if there are any records, sees that there are 5 new records in the database. It then starts the process all over to create the threads and process the records.
How can I have the system continue to look for data but allow the user to stop the system if need be. I tried using this:
while(RecordsFound <=0){
…sleepcode
} ;
RunProcessMethod
But the winform locks, obviously, during the waiting period. I tried adding the wait logic to another thread, but was afraid that if I run the process method from that thread, via a delegate, things would get weird since I am creating additional threads inside that method.
Thoughts?
The easiest way to fix this is to use a notification mechanism instead of polling. That is once you've spawned off the threads to read data from the data base, make them responsible for notifying the UI when they are complete instead of having the UI wait for them to be complete.
The easiest way to do this is to pass in a delegate for the threads to call when they are complete with the set of records found. The UI can then update when the records are available
delegate void CallBack(List<Data> found);
void OnDataFound(List<Data> found) {
// Get back on the UI thread
if ( this.InvokeRequired ) {
this.Invoke( new CallBack(OnDataFound), new object[] { found } );
return;
}
// Update display
}
I tried adding the wait logic to another thread, but was afraid that if I run the process method from that thread, via a delegate, things would get weird since I am creating additional threads inside that method. Thoughts?
You don't need to fear this. It's the proper way to handle this type of scenario. There is no problem with a background thread creating additional threads.
Related
I have many-tables GUI application. Each table filled with assistance of its own BackgroundWorker instance. But now I need render a table that rows are gotten from different slow remote sources. So each its row also should be shown parallel as soon as it be received.
I see two way:
instead of BackgroundWorker for the table to create the instance per
row;
continue use BackgroundWorker (for interactions with UI) but in
DoWorkEventHandler perform Parallel.ForEach requests for source in
collection with ProgressChanged call after its response.
Which is more correct?
assuming it's like my case i have source that are mostly third party and no control over and split around different format (web service, WCF, Local Unmanaged DLL,.Net DLL, Java Service and Excel) just to fill a single list.
Anyhow in my case i used 7 workers.
the first one list all the source and nowaday i have 40-ish sources.
then the worker start up to 6 other workers with 1 source each and update progress async. once one worker finish the main worker start a new one with the next of the list and so on.
I noticed that over 6 i slow performance in my case but that depend on your architecture and type of source. if i had less source accessed by the web i could increase that but bandwidth slows down.
My server application that is written in C# starts a new thread every time it needs to insert or remove data from the database. The problem is that since the execution of the threads is arbitrary it is not ensured that a delete command is executed after the insertion of the same object if those events occur almost at the same time.
E.g.: The server receives the command to insert multiple objects. It takes about 5 seconds to insert all the object. After 1 second of execution the server receives the command to delete all those objects again from the database. Since the removal could happen before all objects are completely stored the outcome is unknown.
How can the order of execution of certain thread be managed?
You can use transactions for this and specify different levels for different operations.
For example, you can use the highest level of transactions for writes/updates/deletes but a low level for reads. You can also fine-tune this to allow blocking of only specific rows, compared to tables. Specific terminology depends on the database and data access library you use.
I would advice against using any ordering. Parallel and ordered just don't go well together. For example:
You need to horizontally scale servers, once you add a second server and a load balancer a mutex solution will not work
In a large and distributed systems a message queue won't work either as by the time one thread completed a scan and decided that we good to go, another thread can write a message that should have prevented operation execution. Moreover, given you receive high load, scanning the same queue multiple times is inefficient.
If you know that you receive insert before delete and the problem is just that you don't want to interrupt your insertion then you can just use lock on your insertion code.
static object m_Lock = new object();
public void Insert()
{
lock (m_Lock)
{
InsertRecords();
}
}
public void Remove()
{
lock (m_Lock)
{
RemoveRecords();
}
}
This way you are sure that remove won't happen during insert.
P.S. Seems strange though that you need to insert and then delete right away.
I think the simplest way is to queue all incoming requests to insert objects in one collection, and to queue all incoming requests to delete objects in a second collection.
The server should have a basic loop that does :
a. check if there are incoming inserts , if so -> perform all inserts.
b. check if there are incoming delete requests, if so -> perform all delete requests.
c. sleep for X milli-seconds.
Now, if you have a delete request on an object that does not exist.
you have two options:
a. igore this request and discard it.
b. ignore this request for this round and keep it in the collection for the next N rounds,
before deleting it (Finally deleting it- assuming this is simply a bad request and is not a problem of race condition.)
Use a Queue (with a single servicing thread) to enforce the ordering. You can also use Task Parallel Library to manage tasks with dependencies on other tasks, though that's very difficult with arbitrary DB operations.
I think you need to rethink how you manage the incoming operations, and whether or not their inter-dependencies are predictable enough that you can safely use multiple threads in this way. You may need to add some "depends on" information into incoming operations to achieve that goal.
I have a database table that contains some records to be processed. The table has a flag column that represents the following status values. 1 - ready to be processed, 2- successfully processed, 3- processing failed.
The .net code (repeating process - console/service) will grab a list of records that are ready to be processed, and loop through them and attempt to process them (Not very lengthy), update status based on success or failure.
To have better performance, I want to enable multithreading for this process. I'm thinking to spawn say 6 threads, each threads grabbing a subset.
Obviously I want to avoid having different threads process the same records. I dont want to have a "Being processed" flag in the database to handle the case where the thread crashes leaving the record hanging.
The only way I see doing this is to grab the complete list of available records and assigning a group (maybe ids) to each thread. If an individual thread fails, its unprocessed records will be picked up next time the process runs.
Is there any other alternatives to dividing the groups prior to assigning them to threads?
The most straightforward way to implement this requirement is to use the Task Parallel Library's
Parallel.ForEach (or Parallel.For).
Allow it to manage individual worker threads.
From experience, I would recommend the following:
Have an additional status "Processing"
Have a column in the database that indicates when a record was picked up for processing and a cleanup task / process that runs periodically looking for records that have been "Processing" for far too long (reset the status to "ready for processing).
Even though you don't want it, "being processed" will be essential to crash recovery scenarios (unless you can tolerate the same record being processed twice).
Alternatively
Consider using a transactional queue (MSMQ or Rabbit MQ come to mind). They are optimized for this very problem.
That would be my clear choice, having done both at massive scale.
Optimizing
If it takes a non-trivial amount of time to retrieve data from the database, you can consider a Producer/Consumer pattern, which is quite straightforward to implement with a BlockingCollection. That pattern allows one thread (producer) to populate a queue with DB records to be processed, and multiple other threads (consumers) to process items off of that queue.
A New Alternative
Given that several processing steps touch the record before it is considered complete, have a look at Windows Workflow Foundation as a possible alternative.
I remember doing something like what you described...A thread checks from time to time if there is something new in database that needs to be processed. It will load only the new ids, so if at time x last id read is 1000, at x+1 will read from id 1001.
Everything it reads goes into a thread safe Queue. When items are added to this queue, you notify the working threads (maybe use autoreset events, or spawn threads here). each thread will read from this thread safe queue one item at a time, until the queue is emptied.
You should not assign before the work foreach thread (unless you know that foreach file the process takes the same amount of time). if a thread finishes the work, then it should take the load from the other ones left. using this thread safe queue, you make sure of this.
Here is one approach that does not rely/use an additional database column (but see #4) or mandate an in-process queue. The premise this approach is to "shard" records across workers based on some consistent value, much like a distributed cache.
Here are my assumptions:
Re-processing does not cause unwanted side-effects; at most some work "is wasted".
The number of threads is fixed upon start-up. This is not a requirement, but it does simplify the implementation and allows me to skip transitory details in the simple description below.
There is only one "worker process" (but see #1) controlling the "worker threads". This simplifies dealing with how the records are split between workers.
There is some [immutable] "ID" column which is "well distributed". This is required so search worker gets about the same amount of work.
Work can be done "out of order" as long as it is "eventually done". Also, workers might not always run "at 100%" due to each one effectively working on a different queue.
Assign each thread a unique bucket value from [0, thread_count). If a thread dies/is restarted it will take the same bucket as that which it vacated.
Then, each time a thread needs a new record is needed it will fetch from the database:
SELECT *
FROM record
WHERE state = 'unprocessed'
AND (id % $thread_count) = $bucket
ORDER BY date
There could of course be other assumptions made about reading the "this threads tasks" in batch and storing them locally. A local queue, however, would be per thread (and thus re-loaded upon a new thread startup) and thus it would only deal with records associated for the given bucket.
When the thread is finished processing a record should mark the record as processed using the appropriate isolation level and/or optimistic concurrency and proceed to the next record.
I have the following scenario:
I have a Database storing Jobs which are catched and processed by a server. The database is accessed via Entity Framework.
The server processes the jobs in parallel using multiple threads. For this I have one thread which periodically checks for new Jobs in the database and Distributes them to worker threads.
My problem now is that my Entities have a Progress property which should be updated by the worker threads and periodically written to the database.
The worker threads update the property quite often (many times per second), but for my requirements it's enough if the database is updated every few seconds and I don't want to make to many unnecessary updates to the database.
So far my solution is to have the worker threads write the progress to the Entity directly and have the thread which checks for updates also issue these changes to the database.
My question here is: Is this thread safe from the EF point of view. Can I update properties of an Entity from one thread and write the changes to the Database on another thread? Do I need any case of locking? Keep in mind that I use the DataContext only in one thread (add least explicitly since I don't know what EF does internally when I update an (Non-POCO) entity.
Another requirement now is that I need to load additional data from the database in the worker processes. I assume I have to use separate DataContexts for this and I don't really like having to manage entities from two separate data contexts in the same thread.
Do you have any advice how to structure this in a nice way?
Since every worker is only updating the status for one Job-Entity, one idea was to expose the progress as a property in the worker-threads class which is fetched by the main thread which would then update the entities and issue the update to the database. But I still need the original Job-Entity in the worker thread to read configuration data and if I reattach it to the DataContext of the worker thread I cannot use the Entity anymore in the main thread. I want to avoid loading the same entity 2 times if it's not really necessary...
Is it possible to duplicate an Entity automatically, to use it in 2 separate DataContexts?
Thanks for any ideas!
At the end I made the following decision:
My main class / main thread reads the jobs from the database and distributes them to various worker threads. For each job there is a corresponding Job-Executor whose .Execute() method is run by the worker thread.
By convention, the Executor-classes read all necessary configuration data from the Job-Entity when it's constructed and isn't allowed to touch it anymore during the execution period. Since construction of the Executor class is done from the main thread, there is no multi threaded access here.
Changing state, like the progress of a Job is exposed via properties on the executor class and periodically synched to the entities/database from the main thread.
The worker threads also have their own DataContext to load additional data when necessary.
All other multi-thread access to DataContext is synchronized with locks.
I think that you should redesign your system a bit.
You are running into trouble because the progress of an entity is stored within the entity.
If you seperate it such that you have one table / context that contains the progress of all jobs. Each thread could update this and it could be saved to the database periodically using a timer.
I have a program that does the following:
Call webservice (there are many calls to the same web service)
Process the result of 1.
Insert result of 2. in a DB
So I think it should be better to do some multithreading. I think I can do it like this:
one thread is the master (let's call it A)
it creates some thread which calls the webservices (let's call it W)
when W has some results it sends it to A (or A detects that W has some stuff)
A sends the results to some computing thread (let's call it C)
when C has some results it sends it to A (or A detects that C has some stuff)
A sends the results to some database thread (let's call it D)
So sometimes C or D will wait for work to do.
With this technique I'll be able to set the thread number for each task.
Can you please tell me how I can do that, maybe if there is any pattern.
EDIT : I added "some" instead of "a", so I'll create many thread for some time-consuming process, and maybe only one for the fastest.
It sounds to me like you could use the producer/consumer pattern. With .NET 4 this has become really simple to implement. Start a number of Tasks and use the BlockingCollection<T> as a buffer between the tasks. Check out this post for details.
In .net you have a thread pool.
When you release a thread it does not actually get closed, it just goes back into the thread pool. When you open a new thread you get one from the thread pool.
If they are not used for a long time the thread pool will close them.
I would start two timers, which will fire their event handlers on separate ThreadPool threads. The event handler for the first timer will check the web service for data, write it to a Queue<T> if it finds some, and then go back to sleep until the next interval.
The event handler for the second timer reads the queue and updates the database, then sleeps until its next interval. Both event handlers should wrap access to the queue a lock to manage concurrency.
Separate timers with independent intervals will let you decouple when data is available from how long it takes to insert it into the database, with the queue acting as a buffer. Since generic queues can grow dynamically, you get some breathing room even if the database is unavailable for a time. The second event handler could even spawn multiple threads to insert multiple records simultaneously or to mirrored databases. The event handlers can also post updates to a log file or user interface to help you monitor activity.