Deferring and re-receiving a deferred message in an IHostBuilder hosted service

Deferring and re-receiving a deferred message in an IHostBuilder hosted service - c#

If the processing of an Azure Service Bus message depends on another resource, e.g. an API or a database service, and this resource is not available, not calling CompleteMessageAsync() is not an option, because the message will be immediately received again until the Max Delivery Count is reached, and then put into the DLQ. If an API is down for maintenance, we want to wait a bit before retrying.
One of the answers to this question has the general steps for deferring and receiving deferred messages. This is a little better than Microsoft's documentation, but not enough for me to understand the intent of the API, and how it is to be implemented in a hosted service that basically sits in ServiceBusProcessor.StartProcessingAsync all day long.
This is the basic structure of my service:
public class ServiceBusWatcher : IHostedService, IDisposable
{
public Task StartAsync(CancellationToken stoppingToken)
{
ReceiveMessagesAsync();
return Task.CompletedTask;
}
private async void ReceiveMessagesAsync()
{
ServiceBusClient client = new ServiceBusClient(connectionString);
processor = client.CreateProcessor(queueName, new ServiceBusProcessorOptions());
processor.ProcessMessageAsync += MessageHandler;
await processor.StartProcessingAsync();
}
async Task MessageHandler(ProcessMessageEventArgs args)
{
// a dependency is not available that allows me to process a message. so:
await args.DeferMessageAsync(args.Message);
Once the message is deferred, it is my understanding that the processor will not get to it anymore (or will it?). Instead, I have to use ReceiveDeferredMessageAsync() to receive it, along with the sequence number of the originally received message.
In my case, it will make sense to wait minutes or hours before trying again.
This could be done with a separate service that uses a timer and an explicit call to ReceiveDeferredMessageAsync(), as opposed to using a ServiceBusProcessor. I also suppose that the deferred message sequence numbers will have to be persisted in non-volatile storage so that they don't get lost.
Does this sound like a viable approach? I don't like having to remember its sequence numbers so that I can get to a message later. It goes against everything that using a message queue brings to the table in the first place.
Or, instead of deferring, I could just post a new "internal" message with the sequence number and use the ScheduledEnqueueTimeUtc property to delay receiving it. Once I receive this message, I could call ReceiveDeferredMessageAsync() with that sequence number to get to the original message. This seems elegant at the surface, but messages could quickly multiply if there is a longer outage of a dependency.
Another idea that could work without another service: I could complete and repost the payload of the message and set ScheduledEnqueueTimeUtc to a time in the future, as described in another answer to the question I mentioned earlier. Assuming that this works (Microsoft's documentation does not mention what this property is for), it seems simple and clean, and I like simple.
How have you solved this? Is there a better/preferred way that balances low complexity with high robustness without requiring a large amount of code?

Deferring a message works when you know what message you want to retrieve later and your receiver will have the message sequence number saved to retrieve the deferred message. If the receiver has no ability to save message sequence number, the delaying the message is a better option. Delaying a message will mean to copy the original message data into a newly scheduled one and completing the original message. That way the consumer doesn't have to neither hold on to the message sequence number nor initiate the retrieval of a specific message.

Related

Azure EventHub: Send Async performance

I have pretty naive code :
public async Task Produce(string topic, object message, MessageHeader messageHeaders)
{
try
{
var producerClient = _EventHubProducerClientFactory.Get(topic);
var eventData = CreateEventData(message, messageHeaders);
messageHeaders.Times?.Add(DateTime.Now);
await producerClient.SendAsync(new EventData[] { eventData });
messageHeaders.Times?.Add(DateTime.Now);
//.....
Log.Info($"Milliseconds spent: {(messageHeaders.Times[1]- messageHeaders.Times[0]).TotalMilliseconds});
}
}
private EventData CreateEventData(object message, MessageHeader messageHeaders)
{
var eventData = new EventData(Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(message)));
eventData.Properties.Add("CorrelationId", messageHeaders.CorrelationId);
if (messageHeaders.DateTime != null)
eventData.Properties.Add("DateTime", messageHeaders.DateTime?.ToString("s"));
if (messageHeaders.Version != null)
eventData.Properties.Add("Version", messageHeaders.Version);
return eventData;
}
in logs I had values for almost 1 second (~ 800 milliseconds)
What could be a reason for such long execution time?

The EventHubProducerClient opens connections to the Event Hubs service lazily, waiting until the first time an operation requires it. In your snippet, the call to SendAsync triggers an AMQP connection to be created, an AMQP link to be created, and authentication to be performed.
Unless the client is closed, most future calls won't incur that overhead as the connection and link are persistent. Most being an important distinction in that statement, as the client may need to reconnect in the face of a network error, when activity is low and the connection idles out, or if the Event Hubs service terminates the connection/link.
As Serkant mentions, if you're looking to understand timings, you'd probably be best served using a library like Benchmark.NET that works ove a large number of iterations to derive statistically meaningful results.

You are measuring the first 'Send'. That will incur some overhead that other Sends won't. So, always do warm up first like send single event and then measure the next one.
Another important thing. It is not right to measure just single 'Send' call. Measure bunch of calls instead and calculate latency percentile. That should provide a better figure for your tests.

Odd Behavior of Azure Service Bus ReceiveBatch()

Working with a Azure Service Bus Topic currently and running into an issue receiving my messages using ReceiveBatch method. The issue is that the expected results are not actually the results that I am getting. Here is the basic code setup, use cases are below:
SubscriptionClient client = SubscriptionClient.CreateFromConnectionString(connectionString, convoTopic, subName);
IEnumerable<BrokeredMessage> messageList = client.ReceiveBatch(100);
foreach (BrokeredMessage message in messageList)
{
try
{
Console.WriteLine(message.GetBody<string>() + message.MessageId);
message.Complete();
}
catch (Exception ex)
{
message.Abandon();
}
}
client.Close();
MessageBox.Show("Done");
Using the above code, if I send 4 messages, then poll on the first run through I get the first message. On the second run through I get the other 3. I'm expecting to get all 4 at the same time. It seems to always return a singular value on the first poll then the rest on subsequent polls. (same result with 3 and 5 where I get n-1 of n messages sent on the second try and 1 message on the first try).
If I have 0 messages to receive, the operation takes between ~30-60 seconds to get the messageList (that has a 0 count). I need this to return instantly.
If I change the code to IEnumerable<BrokeredMessage> messageList = client.ReceiveBatch(100, new Timespan(0,0,0)); then issue #2 goes away because issue 1 still persists where I have to call the code twice to get all the messages.
I'm assuming that issue #2 is because of a default timeout value which I overwrite in #3 (though I find it confusing that if a message is there it immediately responds without waiting the default time). I am not sure why I never receive the full amount of messages in a single ReceiveBatch however.

The way I got ReceiveBatch() to work properly was to do two things.
Disable Partitioning in the Topic (I had to make a new topic for this because you can't toggle that after creation)
Enable Batching on each subscription created like so:
List item
SubscriptionDescription sd = new SubscriptionDescription(topicName, orgSubName);
sd.EnableBatchedOperations = true;
After I did those two things, I was able to get the topics to work as intended using IEnumerable<BrokeredMessage> messageList = client.ReceiveBatch(100, new TimeSpan(0,0,0));

I'm having a similar problem with an ASB Queue. I discovered that I could mitigate it somewhat by increasing the PrefetchCount on the client prior to receiving the batch:
SubscriptionClient client = SubscriptionClient.CreateFromConnectionString(connectionString, convoTopic, subName);
client.PrefetchCount = 100;
IEnumerable<BrokeredMessage> messageList = client.ReceiveBatch(100);
From the Azure Service Bus Best Practices for Performance Improvements Using Service Bus Brokered Messaging:
Prefetching enables the queue or subscription client to load additional messages from the service when it performs a receive operation.
...
When using the default lock expiration of 60 seconds, a good value for
SubscriptionClient.PrefetchCount is 20 times the maximum processing rates of all receivers of the factory. For example, a factory creates 3 receivers, and each receiver can process up to 10 messages per second. The prefetch count should not exceed 20*3*10 = 600.
...
Prefetching messages increases the overall throughput for a queue or subscription because it reduces the overall number of message operations, or round trips. Fetching the first message, however, will take longer (due to the increased message size). Receiving prefetched messages will be faster because these messages have already been downloaded by the client.

Just a few more pieces to the puzzle. I still couldn't get it to work even after Enable Batching and Disable Partitioning - I still had to do two ReceiveBatch calls. I did find however:
Restarting the Service Bus services (I am using Service Bus for Windows Server) cleared up the issue for me.
Doing a single RecieveBatch and taking no action (letting the message locks expire) and then doing another ReceiveBatch caused all of the messages to come through at the same time. (Doing an initial ReceiveBatch and calling Abandon on all of the messages didn't cause that behavior.)
So it appears to be some sort of corruption/bug in Service Bus's in-memory cache.

How to do error handling with EasyNetQ / RabbitMQ

I'm using RabbitMQ in C# with the EasyNetQ library. I'm using a pub/sub pattern here. I still have a few issues that I hope anyone can help me with:
When there's an error while consuming a message, it's automatically moved to an error queue. How can I implement retries (so that it's placed back on the originating queue, and when it fails to process X times, it's moved to a dead letter queue)?
As far as I can see there's always 1 error queue that's used to dump messages from all other queues. How can I have 1 error queue per type, so that each queue has its own associated error queue?
How can I easily retry messages that are in an error queue? I tried Hosepipe, but it justs republishes the messages to the error queue instead of the originating queue. I don't really like this option either because I don't want to be fiddling around in a console. Preferably I'd just program against the error queue.
Anyone?

The problem you are running into with EasyNetQ/RabbitMQ is that it's much more "raw" when compared to other messaging services like SQS or Azure Service Bus/Queues, but I'll do my best to point you in the right direction.
Question 1.
This will be on you to do. The simplest way is that you can No-Ack a message in RabbitMQ/EasyNetQ, and it will be placed at the head of the queue for you to retry. This is not really advisable because it will be retried almost immediately (With no time delay), and will also block other messages from being processed (If you have a single subscriber with a prefetch count of 1).
I've seen other implementations of using a "MessageEnvelope". So a wrapper class that when a message fails, you increment a retry variable on the MessageEnvelope and redeliver the message back onto the queue. YOU would have to do this and write the wrapping code around your message handlers, it would not be a function of EasyNetQ.
Using the above, I've also seen people use envelopes, but allow the message to be dead lettered. Once it's on the dead letter queue, there is another application/worker reading items from the dead letter queue.
All of these approaches above have a small issue in that there isn't really any nice way to have a logarithmic/exponential/any sort of increasing delay in processing the message. You can "hold" the message in code for some time before returning it to the queue, but it's not a nice way around.
Out of all of these options, your own custom application reading the dead letter queue and deciding whether to reroute the message based on an envelope that contains the retry count is probably the best way.
Question 2.
You can specify a dead letter exchange per queue using the advanced API. (https://github.com/EasyNetQ/EasyNetQ/wiki/The-Advanced-API#declaring-queues). However this means you will have to use the advanced API pretty much everywhere as using the simple IBus implementation of subscribe/publish looks for queues that are named based on both the message type and subscriber name. Using a custom declare of queue means you are going to be handling the naming of your queues yourself, which means when you subscribe, you will need to know the name of what you want etc. No more auto subscribing for you!
Question 3
An Error Queue/Dead Letter Queue is just another queue. You can listen to this queue and do what you need to do with it. But there is not really any out of the box solution that sounds like it would fit your needs.

I've implemented exactly what you describe. Here are some tips based on my experience and related to each of your questions.
Q1 (how to retry X times):
For this, you can use IMessage.Body.BasicProperties.Headers. When you consume a message off an error queue, just add a header with a name that you choose. Look for this header on each message that comes into the error queue and increment it. This will give you a running retry count.
It's very important that you have a strategy for what to do when a message exceeds the retry limit of X. You don't want to lose that message. In my case, I write the message to disk at that point. It gives you lots of helpful debugging information to come back to later, because EasyNetQ automatically wraps your originating message with error info. It also has the original message so that you can, if you like, manually (or maybe automated, through some batch re-processing code) requeue the message later in some controlled way.
You can look at the code in the Hosepipe utility to see a good way of doing this. In fact, if you follow the pattern you see there then you can even use Hosepipe later to requeue the messages if you need to.
Q2 (how to create an error queue per originating queue):
You can use the EasyNetQ Advanced Bus to do this cleanly. Use IBus.Advanced.Container.Resolve<IConventions> to get at the conventions interface. Then you can set the conventions for the error queue naming with conventions.ErrorExchangeNamingConvention and conventions.ErrorQueueNamingConvention. In my case I set the convention to be based on the name of the originating queue so that I get a queue/queue_error pair of queues every time I create a queue.
Q3 (how to process messages in the error queues):
You can declare a consumer for the error queue the same way you do any other queue. Again, the AdvancedBus lets you do this cleanly by specifying that the type coming off of the queue is EasyNetQ.SystemMessage.Error. So, IAdvancedBus.Consume<EasyNetQ.SystemMessage.Error>() will get you there. Retrying simply means republishing to the original exchange (paying attention to the retry count you put in the header (see my answer to Q1, above), and information in the Error message that you consumed off the error queue can help you find the target for republishing.

I know this is an old post but - just in case it helps someone else - here is my self-answered question (I needed to ask it because existing help was not enough) that explains how I implemented retrying failed messages on their original queues. The following should answer your question #1 and #3. For #2, you may have to use the Advanced API, which I haven't used (and I think it defeats the purpose of EasyNetQ; one might as well use RabbitMQ client directly). Also consider implementing IConsumerErrorStrategy, though.
1) Since there can be multiple consumers of a message and all may not need to retry a msg, I have a Dictionary<consumerId, RetryInfo> in the body of the message, as EasyNetQ does not (out of the box) support complex types in message headers.
public interface IMessageType
{
int MsgTypeId { get; }
Dictionary<string, TryInfo> MsgTryInfo {get; set;}
}
2) I have implemented a class RetryEnabledErrorMessageSerializer : IErrorMessageSerializer that just updates the TryCount and other information every time it is called by the framework. I attach this custom serializer to the framework on a per-consumer basis via the IoC support provided by EasyNetQ.
public class RetryEnabledErrorMessageSerializer<T> : IErrorMessageSerializer where T : class, IMessageType
{
public string Serialize(byte[] messageBody)
{
string stringifiedMsgBody = Encoding.UTF8.GetString(messageBody);
var objectifiedMsgBody = JObject.Parse(stringifiedMsgBody);
// Add/update RetryInformation into objectifiedMsgBody here
// I have a dictionary that saves <key:consumerId, val: TryInfoObj>
return JsonConvert.SerializeObject(objectifiedMsgBody);
}
}
And in my EasyNetQ wrapper class:
public void SetupMessageBroker(string givenSubscriptionId, bool enableRetry = false)
{
if (enableRetry)
{
_defaultBus = RabbitHutch.CreateBus(currentConnString,
serviceRegister => serviceRegister.Register<IErrorMessageSerializer>(serviceProvider => new RetryEnabledErrorMessageSerializer<IMessageType>(givenSubscriptionId))
);
}
else // EasyNetQ's DefaultErrorMessageSerializer will wrap error messages
{
_defaultBus = RabbitHutch.CreateBus(currentConnString);
}
}
public bool SubscribeAsync<T>(Func<T, Task> eventHandler, string subscriptionId)
{
IMsgHandler<T> currMsgHandler = new MsgHandler<T>(eventHandler, subscriptionId);
// Using the msgHandler allows to add a mediator between EasyNetQ and the actual callback function
// The mediator can transmit the retried msg or choose to ignore it
return _defaultBus.SubscribeAsync<T>(subscriptionId, currMsgHandler.InvokeMsgCallbackFunc).Queue != null;
}
3) Once the message is added to the default error queue, you can have a simple console app/windows service that periodically republishes existing error messages on their original queues. Something like:
var client = new ManagementClient(AppConfig.BaseAddress, AppConfig.RabbitUsername, AppConfig.RabbitPassword);
var vhost = client.GetVhostAsync("/").Result;
var aliveRes = client.IsAliveAsync(vhost).Result;
var errQueue = client.GetQueueAsync(Constants.EasyNetQErrorQueueName, vhost).Result;
var crit = new GetMessagesCriteria(long.MaxValue, Ackmodes.ack_requeue_false);
var errMsgs = client.GetMessagesFromQueueAsync(errQueue, crit).Result;
foreach (var errMsg in errMsgs)
{
var innerMsg = JsonConvert.DeserializeObject<Error>(errMsg.Payload);
var pubInfo = new PublishInfo(innerMsg.RoutingKey, innerMsg.Message);
pubInfo.Properties.Add("type", innerMsg.BasicProperties.Type);
pubInfo.Properties.Add("correlation_id", innerMsg.BasicProperties.CorrelationId);
pubInfo.Properties.Add("delivery_mode", innerMsg.BasicProperties.DeliveryMode);
var pubRes = client.PublishAsync(client.GetExchangeAsync(innerMsg.Exchange, vhost).Result, pubInfo).Result;
}
4) I have a MessageHandler class that contains a callback func. Whenever a message is delivered to the consumer, it goes to the MessageHandler, which decides if the message try is valid and calls the actual callback if so. If try is not valid (maxRetriesExceeded/the consumer does not need to retry anyway), I ignore the message. You can choose to Dead Letter the message in this case.
public interface IMsgHandler<T> where T: class, IMessageType
{
Task InvokeMsgCallbackFunc(T msg);
Func<T, Task> MsgCallbackFunc { get; set; }
bool IsTryValid(T msg, string refSubscriptionId); // Calls callback only
// if Retry is valid
}
Here is the mediator function in MsgHandler that invokes the callback:
public async Task InvokeMsgCallbackFunc(T msg)
{
if (IsTryValid(msg, CurrSubscriptionId))
{
await this.MsgCallbackFunc(msg);
}
else
{
// Do whatever you want
}
}

Here, I have implemented a Nuget package (EasyDeadLetter) for this purpose, which can be easily implemented with the minimum changes in any project.
All you need to do is follow the four steps :
First of all, Decorate your class object with QeueuAttribute
[Queue(“Product.Report”, ExchangeName = “Product.Report”)]
public class ProductReport { }
The second step is to define your dead-letter queue with the same QueueAttribute and also inherit the dead-letter object from the Main object class.
[Queue(“Product.Report.DeadLetter”, ExchangeName =
“Product.Report.DeadLetter”)]
public class ProductReportDeadLetter : ProductReport { }
Now, it’s time to decorate your main queue object with the EasyDeadLetter attribute and set the type of dead-letter queue.
[EasyDeadLetter(DeadLetterType =
typeof(ProductReportDeadLetter))]
[Queue(“Product.Report”, ExchangeName = “Product.Report”)]
public class ProductReport { }
In the final step, you need to register EasyDeadLetterStrategy as the default error handler (IConsumerErrorStrategy).
services.AddSingleton<IBus>
(RabbitHutch.CreateBus(“connectionString”,
serviceRegister =>
{
serviceRegister.Register<IConsumerErrorStrategy,
EasyDeadLetterStrategy>();
}));
That’s all. from now on any failed message will be moved to the related dead-letter queue.
See more detail here :
GitHub Repository
NuGet Package

Serial processing of a certain message type in Rebus

We have a Rebus message handler that talks to a third party webservice. Due to reasons beyond our immediate control, this WCF service frequently throws an exception because it encountered a database deadlock in its own database. Rebus will then try to process this message five times, which in most cases means that one of those five times will be lucky and not get a deadlock. But it frequently happens that a message does get deadlock after deadlock and ends up in our error queue.
Besides fixing the source of the deadlocks, which would be a longterm goal, I can think of two options:
Keep trying with only this particular message type until it succeeds. Preferably I would be able to set a timeout, so "if five deadlocks then try again in 5 minutes" rather than choke the process up even more by trying continuously. I already do a Thread.Sleep(random) to spread the messages somewhat, but it will still give up after five tries.
Send this particular message type to a different queue that has only one worker that processes the message, so that this happens serially rather than in parallel. Our current configuration uses 8 worker threads, but this just makes the deadlock situation worse as the webservice now gets called concurrently and the messages get in each other's way.
Option #2 has my preference, but I'm not sure if this is possible. Our configuration on the receiving side currently looks like this:
var adapter = new Rebus.Ninject.NinjectContainerAdapter(this.Kernel);
var bus = Rebus.Configuration.Configure.With(adapter)
.Logging(x => x.Log4Net())
.Transport(t => t.UseMsmqAndGetInputQueueNameFromAppConfig())
.MessageOwnership(d => d.FromRebusConfigurationSection())
.CreateBus().Start();
And the .config for the receiving side:
<rebus inputQueue="app.msg.input" errorQueue="app.msg.error" workers="8">
<endpoints>
</endpoints>
</rebus>
From what I can tell from the config, it's only possible to set one input queue to 'listen' to. I can't really find a way to do this via the fluent mapping API either. That seems to take only one input- and error queue as well:
.Transport(t =>t.UseMsmq("input", "error"))
Basically, what I'm looking for is something along the lines of:
<rebus workers="8">
<input name="app.msg.input" error="app.msg.error" />
<input name="another.input.queue" error="app.msg.error" />
</rebus>
Any tips on how to handle my requirements?

I suggest you make use of a saga and Rebus' timeout service to implement a retry strategy that fits your needs. This way, in your Rebus-enabled web service facade, you could do something like this:
public void Handle(TryMakeWebServiceCall message)
{
try
{
var result = client.MakeWebServiceCall(whatever);
bus.Reply(new ResponseWithTheResult{ ... });
}
catch(Exception e)
{
Data.FailedAttempts++;
if (Data.FailedAttempts < 10)
{
bus.Defer(TimeSpan.FromSeconds(1), message);
return;
}
// oh no! we failed 10 times... this is probably where we'd
// go and do something like this:
emailService.NotifyAdministrator("Something went wrong!");
}
}
where Data is the saga data that is made magically available to you and persisted between calls.
For inspiration on how to create a saga, check out the wiki page on coordinating stuff that happens over time where you can see an example on how a service might have some state (i.e. number of failed attempts in your case) stored locally that is made available between handling messages.
When the time comes to make bus.Defer work, you have two options: 1) use an external timeout service (which I usually have installed one of on each server), or 2) just use "yourself" as a timeout service.
At configuration time, you go
Configure.With(...)
.(...)
.Timeouts(t => // configure it here)
where you can either StoreInMemory, StoreInSqlServer, StoreInMongoDb, StoreInRavenDb, or UseExternalTimeoutManager.
If you choose (1), you need to check out the Rebus code and build Rebus.Timeout yourself - it's basically just a configurable, Topshelf-enabled console application that has a Rebus endpoint inside.
Please let me know if you need more help making this work - bus.Defer is where your system becomes awesome, and will be capable of overcoming all of the little glitches that make all others' go down :)

Bloomberg API request timing out

Having set up a ReferenceDataRequest I send it along to an EventQueue
Service refdata = _session.GetService("//blp/refdata");
Request request = refdata.CreateRequest("ReferenceDataRequest");
// append the appropriate symbol and field data to the request
EventQueue eventQueue = new EventQueue();
Guid guid = Guid.NewGuid();
CorrelationID id = new CorrelationID(guid);
_session.SendRequest(request, eventQueue, id);
long _eventWaitTimeout = 60000;
myEvent = eventQueue.NextEvent(_eventWaitTimeout);
Normally I can grab the message from the queue, but I'm hitting the situation now that if I'm making a number of requests in the same run of the app (normally around the tenth), I see a TIMEOUT EventType
if (myEvent.Type == Event.EventType.TIMEOUT)
throw new Exception("Timed Out - need to rethink this strategy");
else
msg = myEvent.GetMessages().First();
These are being made on the same thread, but I'm assuming that there's something somewhere along the line that I'm consuming and not releasing.
Anyone have any clues or advice?
There aren't many references on SO to BLP's API, but hopefully we can start to rectify that situation.

I just wanted to share something, thanks to the code you included in your initial post.
If you make a request for historical intraday data for a long duration (which results in many events generated by Bloomberg API), do not use the pattern specified in the API documentation, as it may end up making your application very slow to retrieve all events.
Basically, do not call NextEvent() on a Session object! Use a dedicated EventQueue instead.
Instead of doing this:
var cID = new CorrelationID(1);
session.SendRequest(request, cID);
do {
Event eventObj = session.NextEvent();
...
}
Do this:
var cID = new CorrelationID(1);
var eventQueue = new EventQueue();
session.SendRequest(request, eventQueue, cID);
do {
Event eventObj = eventQueue.NextEvent();
...
}
This can result in some performance improvement, though the API is known to not be particularly deterministic...

I didn't really ever get around to solving this question, but we did find a workaround.
Based on a small, apparently throwaway, comment in the Server API documentation, we opted to create a second session. One session is responsible for static requests, the other for real-time. e.g.
_marketDataSession.OpenService("//blp/mktdata");
_staticSession.OpenService("//blp/refdata");
The means one session operates in subscription mode, the other more synchronously - I think it was this duality which was at the root of our problems.
Since making that change, we've not had any problems.

My reading of the docs agrees that you need separate sessions for the "//blp/mktdata" and "//blp/refdata" services.

A client appeared to have a similar problem. I solved it by making hundreds of sessions rather than passing in hundreds of requests in one session. Bloomberg may not be to happy with this BFI (brute force and ignorance) approach as we are sending the field requests for each session but it works.

Nice to see another person on stackoverflow enjoying the pain of bloomberg API :-)
I'm ashamed to say I use the following pattern (I suspect copied from the example code). It seems to work reasonably robustly, but probably ignores some important messages. But I don't get your time-out problem. It's Java, but all the languages work basically the same.
cid = session.sendRequest(request, null);
while (true) {
Event event = session.nextEvent();
MessageIterator msgIter = event.messageIterator();
while (msgIter.hasNext()) {
Message msg = msgIter.next();
if (msg.correlationID() == cid) {
processMessage(msg, fieldStrings, result);
}
}
if (event.eventType() == Event.EventType.RESPONSE) {
break;
}
}
This may work because it consumes all messages off each event.

It sounds like you are making too many requests at once. BB will only process a certain number of requests per connection at any given time. Note that opening more and more connections will not help because there are limits per subscription as well. If you make a large number of time consuming requests simultaneously, some may timeout. Also, you should process the request completely(until you receive RESPONSE message), or cancel them. A partial request that is outstanding is wasting a slot. Since splitting into two sessions, seems to have helped you, it sounds like you are also making a lot of subscription requests at the same time. Are you using subscriptions as a way to take snapshots? That is subscribe to an instrument, get initial values, and de-subscribe. If so, you should try to find a different design. This is not the way the subscriptions are intended to be used. An outstanding subscription request also uses a request slot. That is why it is best to batch as many subscriptions as possible in a single subscription list instead of making many individual requests. Hope this helps with your use of the api.

By the way, I can't tell from your sample code, but while you are blocked on messages from the event queue, are you also reading from the main event queue while(in a seperate event queue)? You must process all the messages out of the queue, especially if you have outstanding subscriptions. Responses can queue up really fast. If you are not processing messages, the session may hit some queue limits which may be why you are getting timeouts. Also, if you don't read messages, you may be marked a slow consumer and not receive more data until you start consuming the pending messages. The api is async. Event queues are just a way to block on specific requests without having to process all messages from the main queue in a context where blocking is ok, and it would otherwise be be difficult to interrupt the logic flow to process parts asynchronously.

Develop Reference

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.