I have a .Net application that utilizes multiple Hangfire servers.
I want to be able to have one Hangfire RecurringJob trigger multiple BackgroundJobs that can be picked up by any available server. Currently whenever I schedule Background Jobs from a Hangfire Job only the server that scheduled them will process them.
For example, I have 5 Hangfire Servers and 10 tasks.
I would want there to be 2 tasks on each Hangfire server, instead I am seeing 1 server with 10 tasks and 4 with 0.
So again I have 5 Hangfire servers, all using the same database, and 1 RecurringJob, this RecurringJob just reads some files and enqueues several background jobs.
foreach (var file in reportSourceSetFileList)
{
_logger.LogInformation($"Queuing Background job for: {file}");
var backgroundJobId = BackgroundJob.Enqueue<IJobHandler>(job => job.ProcessFile(file, files, null));
}
However, only the Hangfire Server that ran the RecurringJob will process the Enqueued job.
How can I have those Enqueued jobs be processed by any of my 5 Hangfire Servers and not just the one that queued them?
There is no built in functionality in Hangfire to use a round robin type load balancer between multiple hangfire servers.
My solution was to use the Queuing system. When each Hangfire server starts they are given a task identifier, which is a GUID, I also add a unique queue to that server which uses the same GUID as its name.
So each server will look at 2 queues, Default and GUID.
Then I use the following code to find which server has the least jobs currently processing.
private string GetNextAvailableServer()
{
var serverJobCounts = new Dictionary<string, int>();
//get active servers
var serverList = JobStorage.Current.GetMonitoringApi().Servers();
foreach (var server in serverList)
{
if (server.Heartbeat.Value > DateTime.Now.AddMinutes(-1))
{
serverJobCounts.Add(server.Name, 0);
foreach (var queue in server.Queues)
{
var currentQueues = JobStorage.Current.GetMonitoringApi().Queues();
serverJobCounts[server.Name] += (int?)currentQueues.FirstOrDefault(e => e.Name == queue)?.Length ?? 0;
}
}
}
var jobs = JobStorage.Current.GetMonitoringApi().ProcessingJobs(0, int.MaxValue);
foreach (var job in jobs)
{
if (serverJobCounts.ContainsKey(job.Value.ServerId))
{
serverJobCounts[job.Value.ServerId] += 1;
}
}
var nextServer = serverJobCounts.OrderBy(e => e.Value).FirstOrDefault().Key;
return nextServer.Split(':')[0].Replace("-", string.Empty, StringComparison.InvariantCulture);
}
This returns the GUID of the server that has the least jobs, which is also the name of the Queue. Therefore you can schedule the next job to the specific queue with the least jobs currently processing.
var nextServer = GetNextAvailableServer();
var client = new BackgroundJobClient();
var state = new EnqueuedState(nextServer);
var enqueueJob = client.Create<IJobHandler>(job => job.ProcessFile(file, files, null), state);
Additionally when I wrote this Hangfire didn't allow for hyphens in a queue name, hence my string manipulation to make the GUIDs work. I think the newest version of hangfire lets you use hyphens in the name.
One thing to look out for, this solution breaks when one of you server dies. Since a job is given a unique Queue if the server watching that queue dies before processing the job it will never be picked up.
Related
Introduction
Hello all, we're currently working on a microservice platform that uses Azure EventHubs and events to sent data in between the services.
Let's just name these services: CustomerService, OrderService and MobileBFF.
The CustomerService mainly sends updates (with events) which will then be stored by the OrderService and MobileBFF to be able to respond to queries without having to call the CustomerService for this data.
All these 3 services + our developers on the DEV environment make use of the same ConsumerGroup to connect to these event hubs.
We currently make use of only 1 partition but plan to expand to multiple later. (You can see our code is already made to be able to read from multiple partitions)
Exception
Every now and then we're running into an exception though (if it starts it usually keeps throwing this error for an hour or something). For now we've only seen this error on DEV/TEST environments though.
The exception:
Azure.Messaging.EventHubs.EventHubsException(ConsumerDisconnected): At least one receiver for the endpoint is created with epoch of '0', and so non-epoch receiver is not allowed. Either reconnect with a higher epoch, or make sure all epoch receivers are closed or disconnected.
All consumers of the EventHub, store their SequenceNumber in their own Database. This allows us to have each consumer consume events separately and also store the last processed SequenceNumber in it's own SQL database. When the service (re)starts, it loads the SequenceNumber from the db and then requests events from here onwards untill no more events can be found. It then sleeps for 100ms and then retries. Here's the (somewhat simplified) code:
var consumerGroup = EventHubConsumerClient.DefaultConsumerGroupName;
string[] allPartitions = null;
await using (var consumer = new EventHubConsumerClient(consumerGroup, _inboxOptions.EventHubConnectionString, _inboxOptions.EventHubName))
{
allPartitions = await consumer.GetPartitionIdsAsync(stoppingToken);
}
var allTasks = new List<Task>();
foreach (var partitionId in allPartitions)
{
//This is required if you reuse variables inside a Task.Run();
var partitionIdInternal = partitionId;
allTasks.Add(Task.Run(async () =>
{
while (!stoppingToken.IsCancellationRequested)
{
try
{
await using (var consumer = new EventHubConsumerClient(consumerGroup, _inboxOptions.EventHubConnectionString, _inboxOptions.EventHubName))
{
EventPosition startingPosition;
using (var testScope = _serviceProvider.CreateScope())
{
var messageProcessor = testScope.ServiceProvider.GetService<EventHubInboxManager<T, EH>>();
//Obtains starting position from the database or sets to "Earliest" or "Latest" based on configuration
startingPosition = await messageProcessor.GetStartingPosition(_inboxOptions.InboxIdentifier, partitionIdInternal);
}
while (!stoppingToken.IsCancellationRequested)
{
bool processedSomething = false;
await foreach (PartitionEvent partitionEvent in consumer.ReadEventsFromPartitionAsync(partitionIdInternal, startingPosition, stoppingToken))
{
processedSomething = true;
startingPosition = await messageProcessor.Handle(partitionEvent);
}
if (processedSomething == false)
{
await Task.Delay(100, stoppingToken);
}
}
}
}
catch (Exception ex)
{
//Log error / delay / retry
}
}
}
}
The exception is thrown on the following line:
await using (var consumer = new EventHubConsumerClient(consumerGroup, _inboxOptions.EventHubConnectionString, _inboxOptions.EventHubName))
More investigation
The code described above is running in the MicroServices (which are hosted as AppServices in Azure)
Next to that we're also running 1 Azure Function that also reads events from the EventHub. (Probably uses the same consumer group).
According to the documentation here: https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-features#consumer-groups it should be possible to have 5 consumers per consumer group. It seems to be suggested to only have one, but it's not clear to us what could happen if we don't follow this guidance.
We did do some tests with manually spawning multiple instances of our service that reads events and when there were more then 5 this resulted in a different error which stated quite clearly that there could only be 5 consumers per partition per consumer group (or something similar).
Furthermore it seems like (we're not 100% sure) that this issue started happening when we rewrote the code (above) to be able to spawn one thread per partition. (Even though we only have 1 partition in the EventHub). Edit: we did some more log-digging and also found a few exception before merging in the code to spawn one thread per partition.
That exception indicates that there is another consumer configured to use the same consumer group and asserting exclusive access over the partition. Unless you're explicitly setting the OwnerLevel property in your client options, the likely candidate is that there is at least one EventProcessorClient running.
To remediate, you can:
Stop any event processors running against the same Event Hub and Consumer Group combination, and ensure that no other consumers are explicitly setting the OwnerLevel.
Run these consumers in a dedicated consumer group; this will allow them to co-exist with the exclusive consumer(s) and/or event processors.
Explicitly set the OwnerLevel to 1 or greater for these consumers; that will assert ownership and force any other consumers in the same consumer group to disconnect.
(note: depending on what the other consumer is, you may need to test different values here. The event processor types use 0, so anything above that will take precedence.)
To add to the Jesse's answer, I think the exception message is part of
the old SDK.
If you look into the docs, there 3 types of receiving modes defined there:
Epoch
Epoch is a unique identifier (epoch value) that the service uses, to enforce partition/lease ownership.
The epoch feature provides users the ability to ensure that there is only one receiver on a consumer group at any point in time...
Non-epoch:
... There are some scenarios in stream processing where users would like to create multiple receivers on a single consumer group. To support such scenarios, we do have ability to create a receiver without epoch and in this case we allow upto 5 concurrent receivers on the consumer group.
Mixed:
... If there is a receiver already created with epoch e1 and is actively receiving events and a new receiver is created with no epoch, the creation of new receiver will fail. Epoch receivers always take precedence in the system.
I have two Azure Functions. One is HTTP triggered, let's call it the API and the other one ServiceBusQueue triggered, and let's call this one the Listener.
The first one (the API) puts an HTTP request into a queue and the second one (the Listener) picks that up and processes that. The functions SDK version is: 3.0.7.
I have two projects in my solution for this. One which contains the Azure Functions and the other one which has the services. The API once triggered, calls a service from the other project that puts the message into the queue. And the Listener once received a message, calls a service from the service project to process the message.
Any long-running process?
The Listener actually performs a lightweight workflow and it all happens very quickly considering the amount of work it executes. The average time of execution is 90 seconds.
What's the queue specs?
The queue that the Listener listens to and is hosted in an Azure ServiceBus namespace has the following properties set:
Max Delivery Count: 1
Message time to live: 1 day
Auto-delete: Never
Duplicate detection window: 10 min
Message lock duration: 5 min
And here a screenshot for it:
The API puts the HTTP request into the queue using the following method:
public async Task ProduceAsync(string queueName, string jsonMessage)
{
jsonMessage.NotNull();
queueName.NotNull();
IQueueClient client = new QueueClient(Environment.GetEnvironmentVariable("ServiceBusConnectionString"), queueName, ReceiveMode.PeekLock)
{
OperationTimeout = TimeSpan.FromMinutes(5)
};
await client.SendAsync(new Message(Encoding.UTF8.GetBytes(jsonMessage)));
if (!client.IsClosedOrClosing)
{
await client.CloseAsync();
}
}
And the Listener (the service bus queue triggered azure function), has the following code to process the message:
[FunctionName(nameof(UpdateBookingCalendarListenerFunction))]
public async Task Run([ServiceBusTrigger(ServiceBusConstants.UpdateBookingQueue, Connection = ServiceBusConstants.ConnectionStringKey)] string message)
{
var data = JsonConvert.DeserializeObject<UpdateBookingCalendarRequest>(message);
_telemetryClient.TrackTrace($"{nameof(UpdateBookingCalendarListenerFunction)} picked up a message at {DateTime.Now}. Data: {data}");
await _workflowHandler.HandleAsync(data);
}
The Problem
The Listener function processes the same message 3 times! And I have no idea why! I've Googled and read through a few of StackOverFlow threads such as this one. And it looks like that everybody advising to ensure lock duration is long enough for the process to get executed completely. Although, I've put in 5 minutes for the lock, yet, the problem keeps coming. I'd really appreciate any help on this.
Just adding this in here so might be helpful for some others.
After some more investigations I've realized that in my particular case, the issue was regardless of the Azure Functions and Service Bus. In my workflow handler that the UpdateBookingCalendarListenerFunction sends messages to, I was trying to call some external APIs in a parallel approach, but, for some unknown reasons (to me) the handler code was calling off the external APIs one additional time, regardless of how many records it iterates over. The below code shows how I had implemented the parallel API calls and the other code shows how I've done it one by one that eventually led to a resolution for the issue I had.
My original code - calling APIs in parallel
public async Task<IEnumerable<StaffMemberGraphApiResponse>> AddAdminsAsync(IEnumerable<UpdateStaffMember> admins, string bookingId)
{
var apiResults = new List<StaffMemberGraphApiResponse>();
var adminsToAdd = admins.Where(ad => ad.Action == "add");
_telemetryClient.TrackTrace($"{nameof(UpdateBookingCalendarWorkflowDetailHandler)} Recognized {adminsToAdd.Count()} admins to add to booking with id: {bookingId}");
var addAdminsTasks = adminsToAdd.Select(admin => _addStaffGraphApiHandler.HandleAsync(new AddStaffToBookingGraphApiRequest
{
BookingId = bookingId,
DisplayName = admin.DisplayName,
EmailAddress = admin.EmailAddress,
Role = StaffMemberAllowedRoles.Admin
}));
if (addAdminsTasks.Any())
{
var addAdminsTasksResults = await Task.WhenAll(addAdminsTasks);
apiResults = _populateUpdateStaffMemberResponse.Populate(addAdminsTasksResults, StaffMemberAllowedRoles.Admin).ToList();
}
return apiResults;
}
And my new code without aggregating the API calls into the addAdminsTasks object and hence with no await Task.WhenAll(addAdminsTasks):
public async Task<IEnumerable<StaffMemberGraphApiResponse>> AddStaffMembersAsync(IEnumerable<UpdateStaffMember> members, string bookingId, string targetRole)
{
var apiResults = new List<StaffMemberGraphApiResponse>();
foreach (var item in members.Where(v => v.Action == "add"))
{
_telemetryClient.TrackTrace($"{nameof(UpdateBookingCalendarWorkflowDetailHandler)} Adding {targetRole} to booking: {bookingId}. data: {JsonConvert.SerializeObject(item)}");
apiResults.Add(_populateUpdateStaffMemberResponse.PopulateAsSingleItem(await _addStaffGraphApiHandler.HandleAsync(new AddStaffToBookingGraphApiRequest
{
BookingId = bookingId,
DisplayName = item.DisplayName,
EmailAddress = item.EmailAddress,
Role = targetRole
}), targetRole));
}
return apiResults;
}
I've investigated the first approach and the numbers of tasks were exact match of the number of the IEnumerable input, yet, the API was called one additional time. And within the _addStaffGraphApiHandler.HandleAsync, there is literally nothing than an HttpClient object that raises a POSTrequest. Anyway, using the second code has resolved the issue.
I have a Hangfire server set up with several recurring tasks. For local development I don't want these tasks to go through but I need to be able to manually trigger them manually through the Hangfire UI.
I am able to pull the Job Data for the currently running job but I don't see anything within it that tells me if it was manually triggered or not.
Here is an excerpt from my code where RunProcessReportsJob is my RecurringJob in Hangfire
public ExitCodeType RunProcessReportsJob(PerformContext context)
{
var jobId = context.BackgroundJob.Id;
var connection = JobStorage.Current.GetConnection();
var jobData = connection.GetJobData(jobId);
_logger.LogInformation("Reoccurring job disabled.");
return ExitCodeType.NoError;
}
The jobData has a ton of information about the job and context but again I don't see anything within this that tells me if it is a manually triggered job or a scheduled job.
Hope this helps
private bool JobWasManuallyExecuted(string jobId)
{
//'Triggered using recurring job manager' -- Manually triggerd via UI
//'Triggered by recurring job scheduler' -- using scheduller
var jobDetails = JobStorage.Current.GetMonitoringApi().JobDetails(jobId);
if (jobDetails == null)
return false;
return jobDetails.History.ToList().Any(x => x.Reason == "Triggered using recurring job manager");
}
This message appears on the UI as well.
Executed using the scheduler:
Manually executed
I have a situation where I need a recurring job registered with hangfire to run on every server in the cluster.
(The job is to copy some files locally so needs to run on every server regularly)
So far I have tried registering the same job with an id of the server name resulting in n job for n servers:
RecurringJob.AddOrUpdate(Environment.MachineName, () => CopyFiles(Environment.MachineName), Cron.MinuteInterval(_delay));
and the job itself checks if it is the correct server and only does something if it is:
public static void CopyFiles(string taskId)
{
if (string.IsNullOrWhiteSpace(taskId) || !taskId.Equals(Environment.MachineName))
{
return;
}
// do stuff here if it matches our taskname
}
The problem with this is that all jobs executes on the first server to come along, is marked as complete and as a result is not executed by the other servers.
Is there any way to ensure that the job runs on all servers?
or is there a way to ensure that only one server can process a given job? i.e. target the job at the server that created it
Found an answer using this link.
Simply assign the job to a queue that is specific to the server you want it processing on.
So I changed my enqueue to:
RecurringJob.AddOrUpdate(Environment.MachineName,
() => CopyFiles(Environment.MachineName),
Cron.MinuteInterval(_delay),
queue: Environment.MachineName.ToLower(CultureInfo.CurrentCulture));
And when I start my server I do this:
_backgroundJobServer = new BackgroundJobServer(new BackgroundJobServerOptions
{
Queues = new[] { Environment.MachineName.ToLower() }
});
Ok, little bit of background here. I have a large scale web application (MVC3) which does all kinds of unimportant stuff. I need this web application to have the ability to schedule ad-hoc Quartz.NET jobs in an Oracle database. Then, I want the jobs to be executed later on via a windows service. Ideally, I'd like to schedule them to run in even intervals, but with the option to add jobs via the web app.
Basically, the desired architecture is some variation of this:
Web app <--> Quartz.NET <--> Database <--> Quartz.NET <--> Windows Service
What I have coded up so far:
A windows service which (for now) schedules AND runs the Jobs. This obviously isn't going to be the case in the long run, but I'm wondering if I can keep just this and modify it to have it basically represent both "Quartz.NET's" in the diagram above.
The web app (details I guess aren't very important here)
The jobs (which are actually just another windows service)
And a couple important notes:
It HAS to be run from a windows service, and it HAS to be scheduled through the web app (to reduce load on IIS)
The architecture above can be rearranged a little bit, assuming the above bullet still applies.
Now, a few questions:
Is this even possible?
Assuming (1) passes, what do you guys think is the best architecture for this? See first bullet on what I've coded up.
Can somebody maybe give me a few Quartz methods that will help me out with querying the DB for jobs to execute once they're already scheduled?
There will be a bounty on this question in as soon as it is eligible. If the question is answered in a satisfactory way before then, I will still award the bounty to the poster of the answer. So, in any case, if you give a good answer here, you'll get a bounty.
I'll try answering your questions in the order you have them.
Yes, it's possible to do this. It's actually a common way of working with Quartz.Net. In fact, you can also write an ASP.Net MVC application that manages Quartz.Net schedulers.
Architecture. Ideally and at a high level, your MVC application will use the Quartz.Net API to talk to a Quartz.Net server that is installed as a windows service somewhere. Quartz.Net uses remoting to communicate remotely, so any limitations of using remoting apply (like it's not supported in Silverlight, etc). Quartz.Net provides a way to install it as a windows service out of the box, so there really isn't much work to be done here, other than configuring the service itself to use (in your case) an AdoJobStore, and also enabling remoting. There is some care to be taken around how to install the service properly, so if you haven't done that yet, take a look at this post.
Internally, in your MVC application you'll want to get a reference to the scheduler and store it as a singleton. Then in your code you'll schedule jobs and get information about the scheduler through this unique instance. You could use something like this:
public class QuartzScheduler
{
public QuartzScheduler(string server, int port, string scheduler)
{
Address = string.Format("tcp://{0}:{1}/{2}", server, port, scheduler);
_schedulerFactory = new StdSchedulerFactory(getProperties(Address));
try
{
_scheduler = _schedulerFactory.GetScheduler();
}
catch (SchedulerException)
{
MessageBox.Show("Unable to connect to the specified server", "Connection Error", MessageBoxButtons.OK, MessageBoxIcon.Exclamation);
}
}
public string Address { get; private set; }
private NameValueCollection getProperties(string address)
{
NameValueCollection properties = new NameValueCollection();
properties["quartz.scheduler.instanceName"] = "RemoteClient";
properties["quartz.scheduler.proxy"] = "true";
properties["quartz.threadPool.threadCount"] = "0";
properties["quartz.scheduler.proxy.address"] = address;
return properties;
}
public IScheduler GetScheduler()
{
return _scheduler;
}
}
This code sets up your Quart.Net client. Then to access the remote scheduler, just call
GetScheduler()
Querying
Here is some sample code to get all the jobs from the scheduler:
public DataTable GetJobs()
{
DataTable table = new DataTable();
table.Columns.Add("GroupName");
table.Columns.Add("JobName");
table.Columns.Add("JobDescription");
table.Columns.Add("TriggerName");
table.Columns.Add("TriggerGroupName");
table.Columns.Add("TriggerType");
table.Columns.Add("TriggerState");
table.Columns.Add("NextFireTime");
table.Columns.Add("PreviousFireTime");
var jobGroups = GetScheduler().GetJobGroupNames();
foreach (string group in jobGroups)
{
var groupMatcher = GroupMatcher<JobKey>.GroupContains(group);
var jobKeys = GetScheduler().GetJobKeys(groupMatcher);
foreach (var jobKey in jobKeys)
{
var detail = GetScheduler().GetJobDetail(jobKey);
var triggers = GetScheduler().GetTriggersOfJob(jobKey);
foreach (ITrigger trigger in triggers)
{
DataRow row = table.NewRow();
row["GroupName"] = group;
row["JobName"] = jobKey.Name;
row["JobDescription"] = detail.Description;
row["TriggerName"] = trigger.Key.Name;
row["TriggerGroupName"] = trigger.Key.Group;
row["TriggerType"] = trigger.GetType().Name;
row["TriggerState"] = GetScheduler().GetTriggerState(trigger.Key);
DateTimeOffset? nextFireTime = trigger.GetNextFireTimeUtc();
if (nextFireTime.HasValue)
{
row["NextFireTime"] = TimeZone.CurrentTimeZone.ToLocalTime(nextFireTime.Value.DateTime);
}
DateTimeOffset? previousFireTime = trigger.GetPreviousFireTimeUtc();
if (previousFireTime.HasValue)
{
row["PreviousFireTime"] = TimeZone.CurrentTimeZone.ToLocalTime(previousFireTime.Value.DateTime);
}
table.Rows.Add(row);
}
}
}
return table;
}
You can view this code on Github