Managing a long-running data-processing task using threaded queues

Managing a long-running data-processing task using threaded queues - c#

I have a database-synchronisation task that takes some time to process, as there are in the region of 120k leaf records, but they are remote and relatively slow to access.
Currently, my app does a fairly naive process of
Get list of all the local Contacts
For each local contact, get all the related data
Then get the matching remote contact
Compare the two and do stuff to bring them in sync
Step 1 returns data before it's finished, and step 4 doesn't involve comparisons between different contacts in the same set.
What I was hoping to do was use some sort of queue construct and start populating it in step 1, then immediately move onto step 2 and start processing items as they come in, using multiple threads.
The process then becomes:
Start populating the queue with contacts
While there are items in the queue
Start a thread and:
Take the front contact from the queue
Fetch the remote contact
Compare them
Perform the required updates
Am I correct in the assumption that I can create a new ConcurrentQueue, start populating it, then loop over it as I might a single-threaded simple collection?
(I've not put in any error-checking or the actual threading, to keep the example simple)
class Program
{
static void Main(string[] args)
{
Processor p = new Processor();
p.Process();
}
}
class Processor
{
bool FetchComplete = false;
ConcurrentQueue<Contact> q = new ConcurrentQueue<Contact>();
public void Process()
{
this.PopulateQueue(); // this will be fired off using QueueUserWorkItem for example
while (FetchComplete == false)
{
if (q.Count > 0)
{
Contact contact;
q.TryDequeue(out contact);
ProcessContact(contact); // this will also be in QueueUserWorkItem
}
}
}
// a long running process that fills the queue with Contacts
private void PopulateQueue()
{
this.FetchComplete = false;
// foreach contact in database
Contact contact = new Contact(); // contact will come from DB
this.q.Enqueue(contact);
// end foreach
this.FetchComplete = true;
}
private void ProcessContact(Contact contact)
{
// do magic with contact
}
}

You might be better off using BlockingCollection instead of ConcurrentQueue. The reason being that the former will block the thread calling Take until an item appears in the queue. This would be useful when the thread processing the Contract instances clears out the queue before the fetching thread has retrieved them all.
In general your strategy is pretty solid. I use it all the time. It is often referred to as the producer-consumer pattern. When there are more than 2 stages involved in the processing then it is called the pipeline pattern. In that case you would have 2 or more queues instead of the typical one. You can imagine scenarios where each stage forwards the work item onto the next stage via another queue.

Related

IgniteQueue in Apache Ignite.NET

We are using Ignite.NET and don't have option to use Ignite Java API (team skills, technology affinity etc). We are looking to create a queuing mechanism so that we could process messages in distributed fashion. I found IgniteQueue data structure to be most suitable but it doesn't seem to be available in ignite.net could someone please suggest a solution to the scenario. Multiple producers queue a unique work item to be processed reliably by only 1 consumer at a time.
E.g. there are P1,P2 producers (on different machines) they generate T1,T2,T3 on the queue and we have C1,C2,C3 consumers (on different machines) now T1 should be processed by ONLY 1 from C1,C2,C3 and so on for T2,T3 should also similarly be processed only once by 1 consumer

IgniteQueue is built on top of Ignite Cache, so yes, you can replicate the same functionality in .NET:
Create a cache
Use Continuous Query as a consumer, call ICache.Remove to ensure that every item is processed only once
Add data to cache on producers with Data Streamers or just use ICache.Put / PutAll
Below is the code for continuous query listener:
class CacheEventListener<TK, TV> : ICacheEntryEventListener<TK, TV>
{
private readonly string _cacheName;
[InstanceResource] // Injected automatically.
private readonly IIgnite _ignite = null;
private ICache<TK, TV> _cache;
public CacheEventListener(string cacheName)
{
_cacheName = cacheName;
}
public void OnEvent(IEnumerable<ICacheEntryEvent<TK, TV>> events)
{
_cache = _cache ?? _ignite.GetCache<TK, TV>(_cacheName);
foreach (var entryEvent in events)
{
if (entryEvent.EventType == CacheEntryEventType.Created && _cache.Remove(entryEvent.Key))
{
// Run consumer logic here - use another thread for heavy processing.
Consume(entryEvent.Value);
}
}
}
}
Then we deploy this to every node with a single call:
var consumer = new CacheEventListener<Guid, string>(cache.Name);
var continuousQuery = new ContinuousQuery<Guid, string>(consumer);
cache.QueryContinuous(continuousQuery);
As a result, OnEvent is called once per entry on the primary node for that entry. So there is one consumer per Ignite node. We can increase effective number of consumers per node by offloading actual consumer logic to other threads, using BlockingCollection and so on.
And one last thing - we have to come up with a unique cache key for every new entry. Simplest thing is Guid.NewGuid(), but we can also use AtomicSequence.

Multiple users writing at the same file

I have a project which is a Web API project, my project is accessed by multiple users (i mean a really-really lot of users). When my project being accessed from frontend (web page using HTML 5), and user doing something like updating or retrieving data, the backend app (web API) will write a single log file (a .log file but the content is JSON).
The problem is, when being accessed by multiple users, the frontend became unresponsive (always loading). The problem is in writing process of the log file (single log file being accessed by a really-really lot of users). I heard that using a multi threading technique can solve the problem, but i don't know which method. So, maybe anyone can help me please.
Here is my code (sorry if typo, i use my smartphone and mobile version of stack overflow):
public static void JsonInputLogging<T>(T m, string methodName)
{
MemoryStream ms = new MemoryStream();
DataContractJsonSerializer ser = new
DataContractJsonSerializer(typeof(T));
ser.WriteObject(ms, m);
string jsonString = Encoding.UTF8.GetString(ms.ToArray());
ms.Close();
logging("MethodName: " + methodName + Environment.NewLine + jsonString.ToString());
}
public static void logging (string message)
{
string pathLogFile = "D:\jsoninput.log";
FileInfo jsonInputFile = new FileInfo(pathLogFile);
if (File.Exists(jsonInputFile.ToString()))
{
long fileLength = jsonInputFile.Length;
if (fileLength > 1000000)
{
File.Move(pathLogFile, pathLogFile.Replace(*some new path*);
}
}
File.AppendAllText(pathLogFile, *some text*);
}

You have to understand some internals here first. For each [x] users, ASP.Net will use a single worker process. One worker process holds multiple threads. If you're using multiple instances on the cloud, it's even worse because then you also have multiple server instances (I assume this ain't the case).
A few problems here:
You have multiple users and therefore multiple threads.
Multiple threads can deadlock each other writing the files.
You have multiple appdomains and therefore multiple processes.
Multiple processes can lock out each other
Opening and locking files
File.Open has a few flags for locking. You can basically lock files exclusively per process, which is a good idea in this case. A two-step approach with Exists and Open won't help, because in between another worker process might do something. Bascially the idea is to call Open with write-exclusive access and if it fails, try again with another filename.
This basically solves the issue with multiple processes.
Writing from multiple threads
File access is single threaded. Instead of writing your stuff to a file, you might want to use a separate thread to do the file access, and multiple threads that tell the thing to write.
If you have more log requests than you can handle, you're in the wrong zone either way. In that case, the best way to handle it for logging IMO is to simply drop the data. In other words, make the logger somewhat lossy to make life better for your users. You can use the queue for that as well.
I usually use a ConcurrentQueue for this and a separate thread that works away all the logged data.
This is basically how to do this:
// Starts the worker thread that gets rid of the queue:
internal void Start()
{
loggingWorker = new Thread(LogHandler)
{
Name = "Logging worker thread",
IsBackground = true,
Priority = ThreadPriority.BelowNormal
};
loggingWorker.Start();
}
We also need something to do the actual work and some variables that are shared:
private Thread loggingWorker = null;
private int loggingWorkerState = 0;
private ManualResetEventSlim waiter = new ManualResetEventSlim();
private ConcurrentQueue<Tuple<LogMessageHandler, string>> queue =
new ConcurrentQueue<Tuple<LogMessageHandler, string>>();
private void LogHandler(object o)
{
Interlocked.Exchange(ref loggingWorkerState, 1);
while (Interlocked.CompareExchange(ref loggingWorkerState, 1, 1) == 1)
{
waiter.Wait(TimeSpan.FromSeconds(10.0));
waiter.Reset();
Tuple<LogMessageHandler, string> item;
while (queue.TryDequeue(out item))
{
writeToFile(item.Item1, item.Item2);
}
}
}
Basically this code enables you to work away all the items from a single thread using a queue that's shared across threads. Note that ConcurrentQueue doesn't use locks for TryDequeue, so clients won't feel any pain because of this.
Last thing that's needed is to add stuff to the queue. That's the easy part:
public void Add(LogMessageHandler l, string msg)
{
if (queue.Count < MaxLogQueueSize)
{
queue.Enqueue(new Tuple<LogMessageHandler, string>(l, msg));
waiter.Set();
}
}
This code will be called from multiple threads. It's not 100% correct because Count and Enqueue don't necessarily have to be called in a consistent way - but for our intents and purposes it's good enough. It also doesn't lock in the Enqueue and the waiter will ensure that the stuff is removed by the other thread.
Wrap all this in a singleton pattern, add some more logic to it, and your problem should be solved.

That can be problematic, since every client request handled by new thread by default anyway. You need some "root" object that is known across the project (don't think you can achieve this in static class), so you can lock on it before you access the log file. However, note that it will basically serialize the requests, and probably will have a very bad effect on performance.

No multi-threading does not solve your problem. How are multiple threads supposed to write to the same file at the same time? You would need to care about data consistency and I don't think that's the actual problem here.
What you search is asynchronous programming. The reason your GUI becomes unresponsive is, that it waits for the tasks to complete. If you know, the logger is your bottleneck then use async to your advantage. Fire the log method and forget about the outcome, just write the file.
Actually I don't really think your logger is the problem. Are you sure there is no other logic which blocks you?

Better Technique: Reading Data in a Thread

I've got a routine called GetEmployeeList that loads when my Windows Application starts.
This routine pulls in basic employee information from our Active Directory server and retains this in a list called m_adEmpList.
We have a few Windows accounts set up as Public Profiles that most of our employees on our manufacturing floor use. This m_adEmpList gives our employees the ability to log in to select features using those Public Profiles.
Once all of the Active Directory data is loaded, I attempt to "auto logon" that employee based on the System.Environment.UserName if that person is logged in under their private profile. (employees love this, by the way)
If I do not thread GetEmployeeList, the Windows Form will appear unresponsive until the routine is complete.
The problem with GetEmployeeList is that we have had times when the Active Directory server was down, the network was down, or a particular computer was not able to connect over our network.
To get around these issues, I have included a ManualResetEvent m_mre with the THREADSEARCH_TIMELIMIT timeout so that the process does not go off forever. I cannot login someone using their Private Profile with System.Environment.UserName until I have the list of employees.
I realize I am not showing ALL of the code, but hopefully it is not necessary.
public static ADUserList GetEmployeeList()
{
if ((m_adEmpList == null) ||
(((m_adEmpList.Count < 10) || !m_gotData) &&
((m_thread == null) || !m_thread.IsAlive))
)
{
m_adEmpList = new ADUserList();
m_thread = new Thread(new ThreadStart(fillThread));
m_mre = new ManualResetEvent(false);
m_thread.IsBackground = true;
m_thread.Name = FILLTHREADNAME;
try {
m_thread.Start();
m_gotData = m_mre.WaitOne(THREADSEARCH_TIMELIMIT * 1000);
} catch (Exception err) {
Global.LogError(_CODEFILE + "GetEmployeeList", err);
} finally {
if ((m_thread != null) && (m_thread.IsAlive)) {
// m_thread.Abort();
m_thread = null;
}
}
}
return m_adEmpList;
}
I would like to just put a basic lock using something like m_adEmpList, but I'm not sure if it is a good idea to lock something that I need to populate, and the actual data population is going to happen in another thread using the routine fillThread.
If the ManualResetEvent's WaitOne timer fails to collect the data I need in the time allotted, there is probably a network issue, and m_mre does not have many records (if any). So, I would need to try to pull this information again the next time.
If anyone understands what I'm trying to explain, I'd like to see a better way of doing this.
It just seems too forced, right now. I keep thinking there is a better way to do it.

I think you're going about the multithreading part the wrong way. I can't really explain it, but threads should cooperate and not compete for resources, but that's exactly what's bothering you here a bit. Another problem is that your timeout is too long (so that it annoys users) and at the same time too short (if the AD server is a bit slow, but still there and serving). Your goal should be to let the thread run in the background and when it is finished, it updates the list. In the meantime, you present some fallbacks to the user and the notification that the user list is still being populated.
A few more notes on your code above:
You have a variable m_thread that is only used locally. Further, your code contains a redundant check whether that variable is null.
If you create a user list with defaults/fallbacks first and then update it through a function (make sure you are checking the InvokeRequired flag of the displaying control!) you won't need a lock. This means that the thread does not access the list stored as member but a separate list it has exclusive access to (not a member variable). The update function then replaces (!) this list, so now it is for exclusive use by the UI.
Lastly, if the AD server is really not there, try to forward the error from the background thread to the UI in some way, so that the user knows what's broken.
If you want, you can add an event to signal the thread to stop, but in most cases that won't even be necessary.

Good advices to use EF in a multithread program?

Have you got some good advices to use EF in a multithread program ?
I have 2 layers :
a EF layer to read/write into my database
a multithread service which uses my entities (read/write) and makes some computations (I use Task Parallel Library in the framework)
How can I synchronize my object contexts in each thread ?
Do you know a good pattern to make it work ?

Good advice is - just don't :-) EF barely manages to survive one thread - the nature of the beast.
If you absolutely have to use it, make the lightest DTO-s, close OC as soon as you have the data, repack data, spawn your threads just to do calculations and nothing else, wait till they are done, then create another OC and dump data back into DB, reconcile it etc.
If another "main" thread (the one that spawns N calculation threads via TPL) needs to know when some ther thread is done fire event, just set a flag in the other thread and then let it's code check the flag in it's loop and react by creating new OC and then reconciling data if it has to.
If your situation is more simple you can adapt this - the key is that you can only set a flag and let another thread react when it's ready. That means that it's in a stable state, has finished a round of whatever it was doing and can do things without risking race conditions. Reset the flag (an int) with interchaged operations and keep some timing data to make sure that your threads don't react again within some time T - otherwire they can spend their lifetime just querying DB.

This is how I implemented it my scenario.
var processing= new ConcurrentQueue<int>();
//possible multi threaded enumeration only processed non-queued records
Parallel.ForEach(dataEnumeration, dataItem=>
{
if(!processing.Contains(dataItem.Id))
{
processing.Enqueue(dataItem.Id);
var myEntityResource = new EntityResource();
myEntityResource.EntityRecords.Add(new EntityRecord
{
Field1="Value1",
Field2="Value2"
}
);
SaveContext(myEntityResource);
var itemIdProcessed = 0;
processing.TryDequeue(out itemIdProcessed );
}
}
public void RefreshContext(DbContext context)
{
var modifiedEntries = context.ChangeTracker.Entries()
.Where(e => e.State == EntityState.Modified || e.State == EntityState.Deleted);
foreach (var modifiedEntry in modifiedEntries)
{
modifiedEntry.Reload();
}
}
public bool SaveContext(DbContext context,out Exception error, bool reloadContextFirst = true)
{
error = null;
var saved = false;
try
{
if (reloadContextFirst)
this.RefreshContext(context);
context.SaveChanges();
saved = true;
}
catch (OptimisticConcurrencyException)
{
//retry saving on concurrency error
if (reloadContextFirst)
this.RefreshContext(context);
context.SaveChanges();
saved = true;
}
catch (DbEntityValidationException dbValEx)
{
var outputLines = new StringBuilder();
foreach (var eve in dbValEx.EntityValidationErrors)
{
outputLines.AppendFormat("{0}: Entity of type \"{1}\" in state \"{2}\" has the following validation errors:",
DateTime.Now, eve.Entry.Entity.GetType().Name, eve.Entry.State);
foreach (var ve in eve.ValidationErrors)
{
outputLines.AppendFormat("- Property: \"{0}\", Error: \"{1}\"", ve.PropertyName, ve.ErrorMessage);
}
}
throw new DbEntityValidationException(string.Format("Validation errors\r\n{0}", outputLines.ToString()), dbValEx);
}
catch (Exception ex)
{
error = new Exception("Error saving changes to the database.", ex);
}
return saved;
}

I think Craig might be right about your application no needing to have threads.. but you might look for the uses of ConcurrencyCheck in your models to make sure you don't "override" your changes

I don't know how much of your application is actually number crunching. If speed is the motivation for using multi-threading then it might pay off to take a step back and gather data about where the bottle next is.
In a lot of cases I have found that the limiting factor in applications using a database server is the speed of the I/O system for your storage. For example the speed of the hard drive disk(s) and their configuration can have a huge impact. A single hard drive disk with 7,200 RPM can handle about 60 transactions per second (ball park figure depending on many factors).
So my suggestion would be to first measure and find out where the bottle next is. Chances are you don't even need threads. That would make the code substantially easier to maintain and the quality is much higher in all likelihood.

"How can I synchronize my object contexts in each thread ?"
This is going to be tough. First of all SP or the DB queries can have parallel execution plan. So if you also have parallelism on object context you have to manually make sure that you have sufficient isolation but just enough that you dont hold lock too long that you cause deadlocks.
So I would say dont need to do it .
But that might not be the answer you want. So Can you explain a bit more what you want to achieve using this mutithreading. Is it more compute bound or IO bound. If it is IO bound long running ops then look at APM by Jeff Richter.

I think your question is more about synchronization between threads and EF is irrelevvant here. If I understand correctly you want to notify threads from one group when the main thread performed some operation - in this case "SaveChanges()" operation. The threads here are like client-server applications, where one thread is a server and other threads are clients and you want client-threads to react on server activity.
As someone noticed you probably do not need threads, but let's leave it as it is.
There is no fear of dead locks as long as you are going to use separate OC per thread.
I also assume that your client threads are long-running thread in some kind of loop. If you want your code to be executed on client thread you can't use C# events.
class ClientThread {
public bool SomethingHasChanged;
public MainLoop()
{
Loop {
if (SomethingHasChanged)
{
refresh();
SomethingHasChanged = false;
}
// your business logic here
} // End Loop
}
}
Now the question is how you will set the flag in all your client-threads? You could keep references to client threads in your main thread and loop through them and set all flags to true.

Back when I used EF, I simply had one ObjectContext, to which I synchronized all access.
This isn't ideal. Your database layer would effectively be singlethreaded. But, it did keep it thread-safe in a multithreaded environment. In my case, the heavy computation was not in the database code at all - this was a game server, so game logic was of course the primary resource hog. So, I didn't have any particular need for a multithreaded DB layer.

Do you know a Bulked/Batched Flows Library for C#

I am working on a project with peek performance requirements, so we need to bulk (batch?) several operations (for example persisting the data to a database) for efficiency.
However, I want our code to maintain an easy to understand flow, like:
input = Read();
parsed = Parse(input);
if (parsed.Count > 10)
{
status = Persist(parsed);
ReportSuccess(status);
return;
}
ReportFailure();
The feature I'm looking for here is automatically have Persist() happen in bulks (and ergo asynchronously), but behave to its user as if it's synchronous (user should block until the bulk action completes). I want the implementor to be able to implement Persist(ICollection).
I looked into flow-based programming, with which I am not highly familiar. I saw one library for fbp in C# here, and played a bit with Microsoft's Workflow Foundation, but my impression is that both are overkill for what I need. What would you use to implement a bulked flow behavior?
Note that I would like to get code that is exactly like what I wrote (simple to understand & debug), so solutions that involve yield or configuration in order to connect flows to one another are inadequate for my purpose. Also, chaining
is not what I'm looking for - I don't want to first build a chain and then run it, I want code that looks as if it is a simple flow ("Do A, Do B, if C then do D").

Common problem - instead of calling Persist I usually load up commands (or smt along those lines) into a Persistor class then after the loop is finished I call Persistor.Persist to persist the batch.
Just a few pointers - If you're generating sql the commands you add to the persistor can represent your queries somehow (with built-in objects, custom objects or just query strings). If you're calling stored procedures you can use the commands to append stuff to a piece of xml tha will be passed down to the SP when you call the persist method.
hope it helps - Pretty sure there's a pattern for this but dunno the name :)

I don't know if this is what you need, because it's sqlserver based, but have you tried taking a look to SSIS and or DTS?

One simple thing that you can do is to create a MemoryBuffer where you push the messages which simply add them to a list and returns. This MemoryBuffer has a System.Timers.Timer which gets invoked periodically and do the "actual" updates.
One such implementation can be found in a Syslog Server (C#) at http://www.fantail.net.nz/wordpress/?p=5 in which the syslog messages gets logged to a SQL Server periodically in a batch.
This approach might not be good if the info being pushed to database is important, as if something goes wrong, you will lose the messages in MemoryBuffer.

How about using the BackgroundWorker class to persist each item asynchronously on a separate thread? For example:
using System;
using System.Collections;
using System.Collections.Generic;
using System.ComponentModel;
using System.Threading;
class PersistenceManager
{
public void Persist(ICollection persistable)
{
// initialize a list of background workers
var backgroundWorkers = new List<BackgroundWorker>();
// launch each persistable item in a background worker on a separate thread
foreach (var persistableItem in persistable)
{
var worker = new BackgroundWorker();
worker.DoWork += new DoWorkEventHandler(worker_DoWork);
backgroundWorkers.Add(worker);
worker.RunWorkerAsync(persistableItem);
}
// wait for all the workers to finish
while (true)
{
// sleep a little bit to give the workers a chance to finish
Thread.Sleep(100);
// continue looping until all workers are done processing
if (backgroundWorkers.Exists(w => w.IsBusy)) continue;
break;
}
// dispose all the workers
foreach (var w in backgroundWorkers) w.Dispose();
}
void worker_DoWork(object sender, DoWorkEventArgs e)
{
var persistableItem = e.Argument;
// TODO: add logic here to save the persistableItem to the database
}
}

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Managing a long-running data-processing task using threaded queues - c#

Related

IgniteQueue in Apache Ignite.NET

Multiple users writing at the same file

Better Technique: Reading Data in a Thread

Good advices to use EF in a multithread program?

Do you know a Bulked/Batched Flows Library for C#

Categories

Resources