Monitor.TryEnter and Threading.Timer race condition

Monitor.TryEnter and Threading.Timer race condition - c#

I have a Windows service that every 5 seconds checks for work. It uses System.Threading.Timer for handling the check and processing and Monitor.TryEnter to make sure only one thread is checking for work.
Just assume it has to be this way as the following code is part of 8 other workers that are created by the service and each worker has its own specific type of work it needs to check for.
readonly object _workCheckLocker = new object();
public Timer PollingTimer { get; private set; }
void InitializeTimer()
{
if (PollingTimer == null)
PollingTimer = new Timer(PollingTimerCallback, null, 0, 5000);
else
PollingTimer.Change(0, 5000);
Details.TimerIsRunning = true;
}
void PollingTimerCallback(object state)
{
if (!Details.StillGettingWork)
{
if (Monitor.TryEnter(_workCheckLocker, 500))
{
try
{
CheckForWork();
}
catch (Exception ex)
{
Log.Error(EnvironmentName + " -- CheckForWork failed. " + ex);
}
finally
{
Monitor.Exit(_workCheckLocker);
Details.StillGettingWork = false;
}
}
}
else
{
Log.Standard("Continuing to get work.");
}
}
void CheckForWork()
{
Details.StillGettingWork = true;
//Hit web server to grab work.
//Log Processing
//Process Work
}
Now here's the problem:
The code above is allowing 2 Timer threads to get into the CheckForWork() method. I honestly don't understand how this is possible, but I have experienced this with multiple clients where this software is running.
The logs I got today when I pushed some work showed that it checked for work twice and I had 2 threads independently trying to process which kept causing the work to fail.
Processing 0-3978DF84-EB3E-47F4-8E78-E41E3BD0880E.xml for Update Request. - at 09/14 10:15:501255801
Stopping environments for Update request - at 09/14 10:15:501255801
Processing 0-3978DF84-EB3E-47F4-8E78-E41E3BD0880E.xml for Update Request. - at 09/14 10:15:501255801
Unloaded AppDomain - at 09/14 10:15:10:15:501255801
Stopping environments for Update request - at 09/14 10:15:501255801
AppDomain is already unloaded - at 09/14 10:15:501255801
=== Starting Update Process === - at 09/14 10:15:513756009
Downloading File X - at 09/14 10:15:525631183
Downloading File Y - at 09/14 10:15:525631183
=== Starting Update Process === - at 09/14 10:15:525787359
Downloading File X - at 09/14 10:15:525787359
Downloading File Y - at 09/14 10:15:525787359
The logs are written asynchronously and are queued, so don't dig too deep on the fact that the times match exactly, I just wanted to point out what I saw in the logs to show that I had 2 threads hit a section of code that I believe should have never been allowed. (The log and times are real though, just sanitized messages)
Eventually what happens is that the 2 threads start downloading a big enough file where one ends up getting access denied on the file and causes the whole update to fail.
How can the above code actually allow this? I've experienced this problem last year when I had a lock instead of Monitor and assumed it was just because the Timer eventually started to get offset enough due to the lock blocking that I was getting timer threads stacked i.e. one blocked for 5 seconds and went through right as the Timer was triggering another callback and they both somehow made it in. That's why I went with the Monitor.TryEnter option so I wouldn't just keep stacking timer threads.
Any clue? In all cases where I have tried to solve this issue before, the System.Threading.Timer has been the one constant and I think its the root cause, but I don't understand why.

I can see in log you've provided that you got an AppDomain restart over there, is that correct? If yes, are you sure that you have the one and the only one object for your service during the AppDomain restart? I think that during that not all the threads are being stopped right in the same time, and some of them could proceed with polling the work queue, so the two different threads in different AppDomains got the same Id for work.
You probably could fix this with marking your _workCheckLocker with static keyword, like this:
static object _workCheckLocker;
and introduce the static constructor for your class with initialization of this field (in case of the inline initialization you could face some more complicated problems), but I'm not sure is this be enough for your case - during AppDomain restart static class will reload too. As I understand, this is not an option for you.
Maybe you could introduce the static dictionary instead of object for your workers, so you can check the Id for documents in process.
Another approach is to handle the Stopping event for your service, which probably could be called during the AppDomain restart, in which you will introduce the CancellationToken, and use it to stop all the work during such circumstances.
Also, as #fernando.reyes said, you could introduce heavy lock structure called mutex for a synchronization, but this will degrade your performance.

TL;DR
Production stored procedure has not been updated in years. Workers were getting work they should have never gotten and so multiple workers were processing update requests.
I was able to finally find the time to properly set myself up locally to act as a production client through Visual Studio. Although, I wasn't able to reproduce it like I've experienced, I did accidentally stumble upon the issue.
Those with the assumptions that multiple workers were picking up the work was indeed correct and that's something that should have never been able to happen as each worker is unique in the work they do and request.
It turns out that in our production environment, the stored procedure to retrieve work based on the work type has not been updated in years (yes, years!) of deploys. Anything that checked for work automatically got updates which meant when the Update worker and worker Foo checked at the same time, they both ended up with the same work.
Thankfully, the fix is database side and not a client update.

Related

Azure Service Bus - MaxConcurrentCalls=1 - The lock supplied is invalid. Either the lock expired

I am using Azure Service Bus and I have the code below (c# .NetCore 3.1). I am constantly getting the error "The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue, or was received by a different receiver instance." when I call "CompleteAsync"
As you can see in the code I have the "ReceiveMode.PeekLock", "AutoComplete = false" and MaxAutoRenewDuration to 5 min. The code that handles the message completes in less than 1 second and I still get that error every single time.
What drove me crazy is that after hours reading posts, rewriting my code and a lot of "try and error" I decided to increase the MaxConcurrentCalls from 1 to 2 and magically the error disappeared.
Does anybody knows what is going on here?
public void OpenQueue(string queueName)
{
var messageHandlerOptions = new MessageHandlerOptions(exceptionReceivedEventArgs =>
{
Log.Error($"Message handler encountered an exception {exceptionReceivedEventArgs.Exception}.");
return Task.CompletedTask;
});
messageHandlerOptions.MaxConcurrentCalls = 1;
messageHandlerOptions.AutoComplete = false;
messageHandlerOptions.MaxAutoRenewDuration = TimeSpan.FromSeconds(300);
messageReceiver = queueManagers.OpenReceiver(queueName, ReceiveMode.PeekLock);
messageReceiver.RegisterMessageHandler(async (message, token) =>
{
if (await ProcessMessage(message)) //really quick operation less than 1 second
{
await messageReceiver.CompleteAsync(message.SystemProperties.LockToken);
}
else
{
await messageReceiver.AbandonAsync(message.SystemProperties.LockToken);
}
}, messageHandlerOptions);
}

I decided to increase the MaxConcurrentCalls from 1 to 2 and magically the error disappeared.
Concurrency and lock duration is not the only variables in the equation. This sounds like a prefetch issue. If enabled, more messages are prefetched than processed to save on the latency and the roundtrips. If the prefetch is too aggressive, messages that are pre-fetched and waiting are still going to be processed, and while the processing would normally be short enough, the combined time of waiting for processing and the actual processing would exceed the lock duration.
I would suggest to:
Increase MaxLockDuration on the queue
Validate the prefetch count
Regarding MaxLockDuration vs MaxAutoRenewDuration these two are tricky. While the first is guaranteed, the second is not and is a best-effort by the client.

I'm writing the solution for my problem as it may help others.
Turns out the root cause of the problem was a quite basic mistake, but the error got me really confused.
The method OpenQueue was called more than once on the same class instance (multiple queues scenario) what was a mistake. The behavior was quite weird. Looks like queueManagers registered all queues as expected but the token got overwritten causing it to always be invalid.
When I wrote:
I decided to increase the MaxConcurrentCalls from 1 to 2 and magically the error disappeared.
Later that statement proved to be incorrect. When I enabled multiple queues that failed miserably.
The block of code I posted here is actually working. What was around it was broken. I was trying to gain some time and ended up writing bad code. I fixed my design to manage things properly and everything is now running smooth.

Foreground services and repetitive tasks which need to be executed on time

I'm developing an app which basically performs some tasks on timer tick (in this case - searching for beacons) and sends results to the server. My goal was to create an app which does its job constantly in the background. Fortunately, I'm using logging all over the code, so when we started to test it we found that sometime later the timer's callback wasn't being called on time. There were some pauses which obviously had been caused by standby and doze mode. At that moment I was using a background service and System.Threading.Timer. Then, after some research, I rewrote the services to use Alarm Manager + Wake locks, but the pauses were still there. The next try was to make the service foreground and use it with a Handler to post delayed tasks and everything seemed to be fine while the device was connected to the computer. When the device is not connected to a charger those pauses are here again. The interesting thing is that we cannot actually predict this behavior. Sometimes it works perfectly fine and sometimes not. And this is really strange because the code to schedule it is pretty simple and straightforward:
...
private int scanThreadsCount = 0;
private Android.OS.Handler handler = new Android.OS.Handler();
private bool LocationInProgress
{
get { return Interlocked.CompareExchange(ref scanThreadsCount, 0, 0) != 0; }
}
public void ForceLocation()
{
if (!LocationInProgress) DoLocation();
}
private async void DoLocation()
{
Interlocked.Increment(ref scanThreadsCount);
Logger.Debug("Location is started");
try
{
// Location...
}
catch (Exception e)
{
Logger.Error(e, "Location cannot be performed due to an unexpected error");
}
finally
{
if (LocationInterval > 0)
{
# It's here. The location interval is 60 seconds
# and the service is running in the foreground!
# But in the screenshot we can see the delay which
# sometimes reaches 10 minutes or even more
handler.PostDelayed(ForceLocation, LocationInterval * 1000);
}
Logger.Debug("Location has been finished");
Interlocked.Decrement(ref scanThreadsCount);
}
}
...
Actually it can be ok, but I need that service to do its job strictly on time, but the callback is being called with a few seconds delay or a few minutes and that's not acceptable.
The Android documentation says that foreground services are not restricted by standby and doze mode, but I cannot really find the cause of that strange behavior. Why is the callback not being called on time? Where do these 10 minutes pauses come from? It's pretty frustrating because I cannot move further unless I have the robust basis. Does anybody know the reason of such a strange behavior or any suggestions how I can achieve the callback to be executed on time?
P.S. The current version of the app is here. I know, it's quite boring trying to figure out what is wrong with one's code, but there are only 3 files which have to do with that problem:
~/Services/BeaconService.cs
~/Services/BeaconServiceScanFunctionality.cs
~/Services/BeaconServiceSyncFunctionality.cs
The project was provided for those who would probably want to try it in action and figure it out by themselves.
Any help will be appreciated!
Thanks in advance

Terminating Thread Running an Event

I wrote an API that automates a certain website. However, on the testing stage, I noticed that (not very sure), my thread is not being terminated correctly.
I am using the WebBrowser object to navigate inside a thread, so that it works synchronously with my program:
private void NavigateThroughTread(string url)
{
Console.WriteLine("Defining thread...");
var th = new Thread(() =>
{
_wb = new WebBrowser();
_wb.DocumentCompleted += PageLoaded;
_wb.Visible = true;
_wb.Navigate(url);
Console.WriteLine("Web browser navigated.");
Application.Run();
});
Console.WriteLine("Thread defined.");
th.SetApartmentState(ApartmentState.STA);
Console.WriteLine("Before thread start...");
th.Start();
Console.WriteLine("Thread started.");
while (th.IsAlive) { }
Console.WriteLine("Journey ends.");
}
private void PageLoaded(object sender, WebBrowserDocumentCompletedEventArgs e)
{
Console.WriteLine("Pages loads...");
.
.
.
switch (_action)
{
.
.
.
case ENUM.FarmActions.Idle:
_wb.Navigate(new Uri("about:blank"));
_action = ENUM.FarmActions.Exit;
return;
case ENUM.FarmActions.Exit:
Console.WriteLine("Disposing wb...");
_wb.DocumentCompleted -= PageLoaded;
_wb.Dispose();
break;
}
Application.ExitThread(); // Stops the thread
}
Here is how I call this function:
public int Attack(int x, int y, ArmyBuilder army)
{
// instruct to attack the village
_action = ENUM.FarmActions.Attack;
//get the army and coordinates
_army = army;
_enemyCoordinates[X] = x;
_enemyCoordinates[Y] = y;
//Place the attack command
_errorFlag = true; // the action is not complated, the flag will set as false once action is complete
_attackFlag = false; // attack is not made yet
Console.WriteLine("Journey starts");
NavigateThroughTread(_url.GetUrl(ENUM.Screens.RallyPoint));
return _errorFlag ? -1 : CalculateDistance();
}
So the problem is, when I call the Attack function, couple times like this:
_command.Attack(509, 355, new ArmyBuilder(testArmy_lc));
_command.Attack(509, 354, new ArmyBuilder(testArmy_lc));
_command.Attack(505, 356, new ArmyBuilder(testArmy_lc));
_command.Attack(504, 356, new ArmyBuilder(testArmy_lc));
_command.Attack(504, 359, new ArmyBuilder(testArmy_lc));
_command.Attack(505, 356, new ArmyBuilder(testArmy_lc));
_command.Attack(504, 356, new ArmyBuilder(testArmy_lc));
_command.Attack(504, 359, new ArmyBuilder(testArmy_lc));
My application most of the times, gets stuck in one of these function (usually happens after the 4th or 5th). When it gets stuck the last log that I see is
Web browser navigated.
I assume it is something to do with termination of my thread. Can someone show me how I can run a thread which runs the DocumentCompleted event ?

I don't see any obvious reason for deadlock, nor did it reproduce at all when testing the code. There are a number of flaws in the code but nothing that yells "here!" loudly. I can only make recommendations:
Consider that you do not need a thread at all. The while (th.IsAlive) { } hot loop blocks your main thread while you wait for the browser code to finish the job. That is not a useful way to use a thread, you might as well use your main thread. This instantly eliminates a large number of potential hang causes.
The state logic in PageLoaded is risky. We cannot see all of it but one glaring issue is that you dispose the WebBrowser twice. If you have a case where you use return without a Navigate() call then you'll hang as described. No need to unsubscribe the event but same story, if you do unsubscribe but don't all Application.Exit() then you'll hang as described. State machines can be hard to debug, thorough logging is necessary. Minimize the risk by moving the Dispose() call and unsubscribing the event out of the logic, it doesn't belong there. And you need to test what happens when any Navigate() call ends up in failure, redirecting to a page you did not expect.
The _wb.Dispose() call is risky. Note that you destroy the WebBrowser while its DocumentCompleted event is in flight. Technically that can return code execution to code that is no longer alive or present. That can trip a race condition in the browser. As well as in the debugger, there is a dedicated MDA that checks for this problem. It is trivially avoided by moving the Dispose() call after the Application.Run() call where it belongs.
The while-loop burns 100% core, potentially starving the worker thread. Not a good enough reason to explain deadlock, but certainly unnecessary. Use Thread.Join() instead.
You create a lot of WebBrowser objects in this code. It is a very heavy object, as you can imagine, you need to keep an eye on memory usage in your program. Especially the unmanaged kind. If the browser leaks, like they so often do, you could technically create a scenario where the WB initializes okay but does not have enough memory left to load the page. Strongly favor using only one WB.
You need to consider that this might well be an environmental problem. On the top of that list is forever anti-malware and firewall, they always have a very good reason to treat a browser specially since that is the most common malware injection vector. You'll need to run your test with anti-malware and firewall disabled to ensure that it is not the cause of the hang.
Another environmental problem is one I noticed while testing this code, Google got sulky about me hitting it so often and started to throttle the requests, greatly slowing down the code. Talk to the web site owner and ask if he's got similar blocking or throttling counter-measures in place, most do. You need to test your state logic to verify that it still works properly when the browser redirects to an error page.
Yet another environmental issue is the WB will display a dialog itself in certain cases. This can deadlock in 3rd party code, very hard to diagnose. You should at least set the WebBrower.ScriptErrorsSuppressed to true but beware of Javascript code in the web page you load that itself creates new windows or displays alert dialogs. Using one WB is the workaround.
Keep in mind that your program can only be as reliable as your Internet connection and the web page server. That's not a terribly good place to be of course, both are quite out of your reach and you don't get nice exceptions to help you diagnose such a failure. And consider that you probably have not yet tested your program well enough yet to check if it can survive such a failure, it doesn't happen enough.
Quite a laundry list, focus first on eliminating the unnecessary thread and temporarily suppressing anti-malware. That's quick, focus next on using only one WebBrowser.

Hans thank you, I was able to fix this issue with one of your ideas. As you spent your time giving me a long answer, I wanted respond in same manner.
2 - I built the state machine structure carefully and with a lot logs (you can see it from my git account) also did a lot of debugs. I am sure that after I'm done navigating, I use Application.ExitThread() and wb.Dispose() only once.
3 - I tried placing the wb.Dispose() outside the event, however I couldn't find any other place where the Thread is still alive. If I try disposing WebBrowser outside the thread which is created inside the thread, the application gives me an error.
4 - I changed the code while (th.IsAlive) { } with th.Join(2000) this is absolutely a better idea but did not change anything. It optimized the code and as you mentioned, it prevented burning 100% core of my CPU.
5 - I tried using a single WebBrowser object which is instantiated in the constructor. However when I tried to navigate inside the thread, the application wouldnt even fire the events anymore. For some reason, I couldn't make it running whit a single WB object.
6,7 - I tested my application with different PC's and diffrent networks(with firewall and non-firewall protection). I changed windows firewall options as well but no travail. On my original code I do have _wb.ScriptErrorsSuppressed = true; so this shouldn't also be the issue.
8,9 - If these are the reasons, I can't do anything about it. But I doubt the real problem is caused because of them.
1 - This one was a good suggestion. I tried implementing my code without using a thread and it is now working fine. Here is how it looks like (still needs a lot optimization)
// Constructer
public FarmActions(string token)
{
// set the urls using the token
_url = new URL(token);
// define web browser properties
_wb = new WebBrowser();
_wb.DocumentCompleted += PageLoaded;
_wb.Visible = true;
_wb.AllowNavigation = true;
_wb.ScriptErrorsSuppressed = true;
}
public int Attack(int x, int y, ArmyBuilder army)
{
// instruct to attack the village
_action = ENUM.FarmActions.Attack;
//get the army and coordinates
_army = army;
_enemyCoordinates[X] = x;
_enemyCoordinates[Y] = y;
//Place the attack command
_errorFlag = true; // the action is not complated, the flag will set as false once action is complete
_attackFlag = false; // attack is not made yet
_isAlive = true;
Console.WriteLine("-------------------------");
Console.WriteLine("Journey starts");
NavigateThroughTread(_url.GetUrl(ENUM.Screens.RallyPoint));
return _errorFlag ? -1 : CalculateDistance();
}
private void NavigateThroughTread(string url)
{
Console.WriteLine("Defining thread...");
_wb.Navigate(url);
while (_isAlive) Application.DoEvents();
}
private void PageLoaded(object sender, WebBrowserDocumentCompletedEventArgs e)
{
Console.WriteLine("Pages loads...");
.
.
.
switch (_action)
{
.
.
.
case ENUM.FarmActions.Idle:
_wb.Navigate(new Uri("about:blank"));
_action = ENUM.FarmActions.Exit;
return;
case ENUM.FarmActions.Exit:
break;
}
_isAlive = false;
}
This is how I was able to wait without using a thread.
The main problem was probably as you mentioned in number 3 or 5. But I wasn't able to fix the problem as I spent couple of hours.
Anyway thanks for your help it works.

Thread Monitor class in c#

In my c# application multiple clients will access the same server, to process one client ata a time below code is written.In the code i used Moniter class and also the queue class.will this code affect the performance.if i use Monitor class, then shall i remove queue class from the code.
Sometimes my remote server machine where my application running as service is totally down.is the below code is the reasond behind, coz all the clients go in a queue, when i check the netstatus -an command using command prompt, for 8 clients it shows 50 connections are holding in Time-wait...
Below is my code where client acces the server ...
if (Id == "")
{
System.Threading.Monitor.Enter(this);
try
{
if (Request.AcceptTypes == null)
{
queue.Enqueue(Request.QueryString["sessionid"].Value);
string que = "";
que = queue.Dequeue();
TypeController.session_id = que;
langStr = SessionDatabase.Language;
filter = new AllThingzFilter(SessionDatabase, parameters, langStr);
TypeController.session_id = "";
filter.Execute();
Request.Clear();
return filter.XML;
}
else
{
TypeController.session_id = "";
filter = new AllThingzFilter(SessionDatabase, parameters, langStr);
filter.Execute();
}
}
finally
{
System.Threading.Monitor.Exit(this);
}
}

Locking this is pretty wrong, it won't work at all if every thread uses a different instance of whatever class this code lives in. It isn't clear from the snippet if that's the case but fix that first. Create a separate object just to store the lock and make it static or give it the same scope as the shared object you are trying to protect (also not clear).
You might still have trouble since this sounds like a deadlock rather than a race. Deadlocks are pretty easy to troubleshoot with the debugger since the code got stuck and is not executing at all. Debug + Break All, then Debug + Windows + Threads. Locate the worker threads in the thread list. Double click one to select it and use Debug + Call Stack to see where it got stuck. Repeat for other threads. Look back through the stack trace to see where one of them acquired a lock and compare to other threads to see what lock they are blocking on.
That could still be tricky if the deadlock is intricate and involves multiple interleaved locks. In which case logging might help. Really hard to diagnose mandelbugs might require a rewrite that cuts back on the amount of threading.

.NET Denying Access to Directories After Copy Operations

I'm performing a "safe" copy of a directory over another directory as follows:
Given the source C:\Source and target C:\Target
Copy C:\Source to C:\Target-incoming
Move C:\Target (if it exists) to C:\Target-outgoing
Move C:\Target-incoming to C:\Target
Delete C:\Target-outgoing (if it exists)
If any of the first three steps fail, I'll attempt to put things back as they were to prevent data loss.
However, the move of C:\Target-incoming to C:\Target fails with "Access to the path C:\Target-incoming is denied" most of the time.
At the moment, inserting Thread.Sleep(100) just before the move operation fixes the problem for me. However, waiting .1 of a second seems ridiculous to me. Thread.Sleep(10) isn't enough to fix it. I also have the sinking feeling that the value I have to wait depends on the speed of disk IO.
So, my questions:
Can I prevent this from happening?
If not, is there a way of finding out when the lock on the directory is released after copying it?
Edit: For clarity, I'm doing all these operations in one method on one thread, and I'm just using Thread.Sleep() to pause code flow for a moment. The moves and copies are being done with standard .NET Directory.Move(), Directory.CreateDirectory() and File.CopyTo() methods. It would appear that the .NET methods are returning before the locks on the respective files are being released, causing the necessity to wait an amount of time before continuing.

What could be happening is probably that your thread is trying to "Move C:\Target-incoming to C:\Target" WHILE the "Move C:\Target to C:\Target-outgoing" is NOT finished YET.
This track is confirmed by the success of your process after short Thread Sleep.
Try to Chain your processes, i.e : Divide each step into specific methods, and call the methods one after the other (sync'ing the start of a method to the end of the previous one)
There are various ways to do that (among others syncing/locking/chaining different threads per process/step)
You could check Thread Synchronization in .NET
But of course, this is not the only possible cause for your problem.

After a bunch of testing, it seems like the very act of trying to move a locked folder gets the OS to hurry up and release the lock, even if the first attempt fails.
I wrote this extension method to DirectoryInfo:
public static void TryToMoveTo(this DirectoryInfo o, string targetPath) {
int attemptsRemaining = 5;
while (true) {
try {
o.MoveTo(targetPath);
break;
} catch (Exception) {
if (attemptsRemaining == 0) {
throw;
} else {
attemptsRemaining--;
System.Threading.Thread.Sleep(10);
}
}
}
}
While debugging the original problem, I settled on waiting for 100ms as anything less seemed to cause exceptions (I tried 10, 25, 50, 75 and 100ms). However, in the method above I wait 10ms before retrying, and I never, ever got more than one exception thrown in each of my hundreds of test runs.

You can always try waiting in a loop, up till a maximum # of tries. You can check to see if the directory is locked by calling CreateFile and checking it's return code. Be sure to read through the "flags" section of the docs because you need to pass in a special flag to open a directory.
Someone else mentioned in a comment that you may want to try Transactional NTFS. If you can, you might want to try that.

check wethere source and target directories exist before copying or moving using io.directory.exists
the access deneied error is caused by either source or target are not found.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.