I have a set of files in an S3 bucket and I'd like to have multiple clients process the files by downloading and deleting them so they can be processed locally.
How can I ensure that only one client can access any single file so exactly one worker download and processes it? I know I can introduce an additional queuing system or other external process to implement some kind of FIFO queue or locking mechanism, but I'm really hoping to minimize the number of components here so it's simply
(file_generation -> S3 -> workers) without adding more systems to manage or things that might break.
So is there any way to obtain a lock on a file or somehow atomically tag it for a single worker such that other workers will know to ignore it? Perhaps renaming the object's key with the worker's ID so it's "claimed" and no one else will touch it?
Why are you using a filestore like a queue? Why not use a queue? (From your question, it sounds like you are being lazy!).
If you want to keep a similar workflow, create a file on S3 and post the URI of the file to the queue (this can be done automatically by AWS).
Queues can have multiple consumers and there will never be any conflicts (normally).
How can I ensure that only one client can access any single file so exactly one worker download and processes it?
By using a queue, such as an Amazon SQS queue:
Create an SQS queue
Configure the S3 bucket to automatically send a message to the queue when a new object is created
Configure your workers to poll the SQS queue for messages.
When they receive a message, the message is temporarily made 'invisible' but is not removed from the queue
When the worker has completed their process, they delete the message from the SQS queue
It meets your requirements 100% and works "out-of-the-box". Much more reliable than writing your own process.
what if an entry in the queue gets lost and the file remains
Amazon SQS supports the concept of an invisibility period while messages are being processed, but are not fully processed. If a worker fails to delete the message after processing, the message will reappear on the queue after a defined period, ready for another worker to process it.
or the queue goes offline
Amazon SQS is a regional service, which means that the queues are replicated between multiple Availability Zones, operated by parallel servers.
or objects get renamed
It is not possible to 'rename' objects in Amazon S3. An object would need to be copied and the original object deleted.
Related
I am trying to create an application that is just a queue and it will have worker threads processing messages in the queue. That by itself shouldn't be a problem, but the queue will not originate from filesystem. What I would like to do is bypass the filesystem entirely and send the data to the application, to be added to the queue. I want to bypass the filesystem because I have a scheduler that can run dozens of times per second which would be used to add to the queue eventually.
How can I send data to this application while it is already running?
Should I call CreateIfNotExistsAsync() before every read/write on Azure queue?
I know it results in a REST call, but does it do any IO on the queue?
I am using the .Net library for Azure Queue (if this info is important).
All that method does is try to create the queue and catches the AlreadyExists error, which you could just as easily replicate yourself by catching the 404 when you try and access the queue. There is bound to be some performance impact.
More importantly, it increases your costs: from the archive of Understanding Windows Azure Storage Billing – Bandwidth, Transactions, and Capacity [MSDN]
We have seen applications that perform a CreateIfNotExist [sic] on a Queue before every put message into that queue. This results in two separate requests to the storage system for every message they want to enqueue, with the create queue failing. Make sure you only create your Blob Containers, Tables and Queues at the start of their lifetime to avoid these extra transaction costs.
I have multiple queues that multiple clients insert messages into them.
On the server side, I have multiple micro-services that access the queues and handle those messages. I want to lock a queue whenever a service is working on it, so that other services won't be able to work on that queue.
Meaning that if service A is processing a message from queue X, no other service can process a message from that queue, until service A has finished processing the message. Other services can process messages from any queue other than X.
Does anyone has an idea on how to lock a queue and prevent others from accessing it? preferably the other services will receive an exception or something so that they'll try again on a different queue.
UPDATE
Another way can be to assign the queues to the services, and whenever a service is working on a queue no other service should be assigned to the queue, until the work item was processed. This is also something that isn't easy to achieve.
There are several built-in ways of doing this. If you only have a single worker, you can set MessageOptions.MaxConcurrentCalls = 1.
If you have multiple, you can use the Singleton attribute. This gives you the option of setting it in Listener mode or Function mode. The former gives the behavior you're asking for, a serially-processed FIFO queue. The latter lets you lock more granularly, so you can specifically lock around critical sections, ensuring consistency while allowing greater throughput, but doesn't necessarily preserve order.
My Guess is they'd have implemented the singleton attribute similarly to your Redis approach, so performance should be equivalent. I've done no testing with that though.
You can achieve this by using Azure Service Bus message sessions
All messages in your queue must be tagged with the same SessionId. In that case, when a client receives a message, it locks not only this message but all messages with the same SessionId (effectively whole queue).
The solution was to use Azure's redis to store the locks in-memory and have micro-services that manage those locks using the redis store.
The lock() and unlock() operations are atomic and the lock has a TTL, so that a queue won't be locked indefinitely.
Azure Service Bus is a broker with competing consumers. You can't have what you're asking with a general queue all instances of your service are using.
Put the work items into a relational database. You can still use queues to push work to workers but the queue items can now be empty. When a worker receives an item he know to look into the database instead. The content of the message is disregarded.
That way messages are independent and idempotent. For queueing to work these two properties usually must hold.
That way you can more easily sequence actions that actually are sequential. You can use transactions as well.
Maybe you don't need queues at all. Maybe it is enough to have a fixed number of workers polling the database for work. This loses auto-scaling with queues, though.
I am working on a command processing application which uses azure service bus queue.
Commands are issued from a website and posted to the queue and the queue messages are processed by a worker role. Processing involves fetching data from db and other sources based on the queue message values and sending it to different topics. The flow is ,
Receive message
process the message
Mark message as complete / Abandon message On processing exception.
The challenge I face here is the processing time. Sometimes it exceeds the maximum message lock time period (5 minutes -configured) and hence the message is unlocked and it re-appears for the worker role to pick up (consider multiple instances of the worker role). So this causes same message to be processed again.
What are the options I have to handle such a scenario.?
I have thought about ,
Receive message - add to a local variable - mark message complete.
In case of exception send the message again to the queue or to a
separate queue (let us say failed message queue). A second queue
also means another worker role to process it.
In the processing there is a foreach loop that runs. So I thought of
using a Parallel.Foreach instead . but not sure how much of time
gain it will give and also read some posts on issues when using
Parallel in azure.
Suggestions,fixes welcome.
Aravind, you can absolutely use SB queue in this scenario. With the latest SDK you can renew the lock on your message for as long as your are continuing to process it. Details are at: http://msdn.microsoft.com/en-us/library/microsoft.servicebus.messaging.brokeredmessage.renewlock.aspx
This is similar to the Azure storage queue functionality of updating the visibility timeout: http://msdn.microsoft.com/en-us/library/windowsazure/microsoft.windowsazure.storage.queue.cloudqueue.updatemessage.aspx
You may want to consider using an Azure Queue, the maxium lease time for an Azure Queue message is 7 days, as opposed to the Azure Service Bus Queue lease time of 5 minutes.
This msdn article describes the differences between the two Azure queue types.
If the standard Azure Queue doesn't contain all the features you need you might consider using both types of Queue.
You can fire off a Task with a heartbeat operation that keeps renewing the lock for you while you're processing it. This is exactly what I do. I described my approach at Creating a Task with a heartbeat
The project that I'm working on uses a commercially available package to route audio to various destinations. With this package is a separate application that can be used to log events generated by the audio routing software to a database e.g. connect device 1 to device 3.
I have been tasked with writing an application that reacts to specific events generated by the audio routing software such as reacting to any connections to device 3.
I have noted that the audio routing sofware uses MSMQ to post event information to the event recorder. This means that event data can build up if the recorder software has not run for a while.
I have located the queue - ".\private$\AudioLog" and would like to perform the following actions:
Detect and process new messages as
they are entered onto the queue.
Allow the current event recording
software to continue to
work as before - therefore messages
can not be removed by my
application.
Ensure that I always get to see a
message.
Now I note that I can use MessageQueue to Peek at the queue in order to read messages without deletion and also GetAllMessages() to peek at all messages not removed by the event recorder.
If the recording software isn't connected then I can see that I can gather message data easily enough, but I can't see how I can ensure that I get to see a message before the recorder removes a message when it is connected.
Ideally I would like to add my application as a second destination for the message queue. Is this possible programmatically?
If not as I have administrator privilege, access to the machine with the queue is it possible to configure the queue manually to branch a second copy of the queue to which I can connect my software?
Msmq has a journaling feature. You can configure the queue to have a journal. Then, every message that is removed from the queue( by a read operation) is moved to the journal queue and not deleted. You can then read (or peek) from the journal. If you are using peek operation, make sure that you have a job that delete the journal from time to time.