I'am using Redis cache as distributed cache in ASP.NET app.
It works until Redis server becomes unavailable and the question is:
How to properly handle disconnection issues?
Redis is configured this way (Startup.cs):
services.AddDistributedRedisCache(...)
Option AbortOnConnectFail is set to false
Injected in service via constructor:
...
private IDistributedCache _cache
public MyService(IDistributedCache cache)
{
_cache = cache;
}
When Redis is down the following code throws an exception (StackExchange.Redis.RedisConnectionException: SocketFailure on 127.0.0.1:6379/Subscription ...):
var val = await _cache.GetAsync(key, cancellationToken);
I don't think that using reflection to inspect a connection state inside _cache object is a good way. So are there any 'right' options to handle it?
Maybe you can check Polly Project. It has Retry/WaitAndRetry/RetryForever and Circuit Breakers that can be handy. So you can catch that RedisConnectionException And then retry or fallback to other method.
You have Plugin for Microsoft DistributedCache Provider.
Check it out.
First of all, why is your Redis server becoming unavailable? And for how long? You should minimize these kinds of situations. Do you use Redis as a service from AWS i.e. ElasticCache? If so you can configure it to promote a new Redis slave /read-replice server to become a master if the first master fails.
To improve fault tolerance and reduce write downtime, enable Multi-AZ with Automatic Failover for your Redis (cluster mode
disabled) cluster with replicas. For more information, see Minimizing
downtime in ElastiCache for Redis with Multi-AZ.
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/AutoFailover.html
Apart from that, a fallback solution to an unresponsive Redis server would be just to retrieve the objects/entities that your a caching in Redis from the database if the Redis server is down. You can retry the Redis call two times with 5 seconds between each retry and if the server is still down you should just query the database. This would result in a performance hit but it is a better solution than throwing an error.
T val = null;
int retryCount = 0;
do
{
try
{
val = await _cache.GetAsync(key, cancellationToken);
}
catch(Exception ex)
{
retryCount++;
Thread.Sleep(retryCount * 2000)
}
}
while(retryCount < 3 && val == null);
if (val == null)
{
var = call to database
}
Related
I have a Redis Database on a Centos server, and 3 Windows servers are connected to it with approximately 1,000 reads/writes per second, all of which are on the same local LAN, so the ping time is less than one millisecond.
The problem is at least 5 percent of reading operations are going timeout, while I read maximum 3KB data in a read operation with 'syncTimeout=15', which is much more than network latency.
I installed Redis on bash on my windows 10, and simulate the problem. I also stopped writing operations. However, the problem still exists with 0.5 percent timeouts, while there is no network latency.
I also used a Centos Server in my LAN to simulate the problem, in this case, I need at 100 milliseconds for 'syncTimeout' to be sure the amount of timeout is less than 1 percent.
I considered using some Dictionaries to cache data from Redis, so there is no need to request per item, and I can take advantage of the pipeline. But I came across StackRedis.L1 which is developed as an L1 cache for Redis, and it is not confident in updating the L1 cache.
This is my code to simulate the problem:
var connectionMulti = ConnectionMultiplexer.Connect(
"127.0.0.1:6379,127.0.0.1:6380,allowAdmin=true,syncTimeout=15");
// 100,000 keys
var testKeys = File.ReadAllLines("D:\\RedisTestKeys.txt");
for (var i = 0; i < 3; i++)
{
var safeI = i;
Task.Factory.StartNew(() =>
{
var serverName = $"server {safeI + 1}";
var stringDatabase = connectionMulti.GetDatabase(12);
PerformanceTest($"{serverName} -> String: ",
key => stringDatabase.StringGet(key), testKeys);
});
}
and the PerformanceTest method is:
private static void PerformanceTest(string testName, Func<string, RedisValue> valueExtractor,
IList<string> keys)
{
Task.Factory.StartNew(() =>
{
Console.WriteLine($"Starting {testName} ...");
var timeouts = 0;
var errors = 0;
long totalElapsedMilliseconds = 0;
var stopwatch = new Stopwatch();
foreach (var key in keys)
{
var redisValue = new RedisValue();
stopwatch.Restart();
try
{
redisValue = valueExtractor(key);
}
catch (Exception e)
{
if (e is TimeoutException)
timeouts++;
else
errors++;
}
finally
{
stopwatch.Stop();
totalElapsedMilliseconds += stopwatch.ElapsedMilliseconds;
lock (FileLocker)
{
File.AppendAllLines("D:\\TestResult.csv",
new[]
{
$"{stopwatch.ElapsedMilliseconds.ToString()},{redisValue.Length()},{key}"
});
}
}
}
Console.WriteLine(
$"{testName} {totalElapsedMilliseconds * 1.0 / keys.Count} (errors: {errors}), (timeouts: {timeouts})");
});
}
I expect all read operations will be done successfully less than 15 milliseconds.
Achieving this, is Considering L1 cache for a Redis cache a good solution? (It is very fast, in the scale of a nanosecond, but how can I do for syncronizing)
Or Redis can be enhanced by clustering or something else? (While I tested it on bash on my PC, and I did not receive expected result)
Or Redis can be enhanced by clustering or something else?
Redis can be clustered, in different ways:
"regular" redis can be replicated to secondary read-only nodes, on the same machine or different machines; you can then send "read" traffic to some of the replicas
redis "cluster" exists, which allows you to split (shard) the keyspace over multiple primaries, sending appropriate requests to each node
redis "cluster" can also make use of readonly replicas of the sharded nodes
Whether that is appropriate or useful is contextual and needs local knowledge and testing.
Achieving this, is Considering L1 cache for a Redis cache a good solution?
Yes, it is a good solution. A request you don't make is much faster (and has much less impact on the impact) than a request you do make. There are tools for helping with cache invalidation, including using the pub/sub API for invalidations. Redis vNext is also looking into additional knowledge APIs specifically for this kind of L1 scenario.
There is a .NET 4.7 WebAPI application working with SQL Server using Entity Framework and hosting NServiceBus endpoint with MSMQ transport.
Simplified workflow can be described by a controller action:
[HttpPost]
public async Task<IHttpActionResult> SendDebugCommand()
{
var sample = new Sample
{
State = SampleState.Initial,
};
_dataContext.Set<Sample>().Add(sample);
await _dataContext.SaveChangesAsync();
sample.State = SampleState.Queueing;
var options = new TransactionOptions
{
IsolationLevel = IsolationLevel.ReadCommitted,
};
using (var scope = new TransactionScope(TransactionScopeOption.Required, options, TransactionScopeAsyncFlowOption.Enabled))
{
await _dataContext.SaveChangesAsync();
await _messageSession.Send(new DebugCommand {SampleId = sample.Id});
scope.Complete();
}
_logger.OnCreated(sample);
return Ok();
}
And DebugCommand handler, that is sent to the same NServiceBus endpoint:
public async Task Handle(DebugCommand message, IMessageHandlerContext context)
{
var sample = await _dataContext.Set<Sample>().FindAsync(message.SampleId);
if (sample == null)
{
_logger.OnNotFound(message.SampleId);
return;
}
if (sample.State != SampleState.Queueing)
{
_logger.OnUnexpectedState(sample, SampleState.Queueing);
return;
}
// Some work being done
sample.State = SampleState.Processed;
await _dataContext.SaveChangesAsync();
_logger.OnHandled(sample);
}
Sometimes, message handler retrieves the Sample from the DB and its state is still Initial, not Queueing as expected. That means that distributed transaction initiated in the controller action is not yet fully complete. That is also confirmed by time-stamps in the log file.
The 'sometimes' happens quite rarely, under heavier load and network latency probably affects. Couldn't reproduce the problem with local DB, but easily with a remote DB.
I checked DTC configurations. I checked there is escalation to a distributed transaction for sure. Also if scope.Complete() is not called then there will be no DB update neither message sending happening.
When the transaction scope is completed and disposed, intuitively I expect both DB and MSMQ to be settled before a single further instruction is executed.
I couldn't find definite answers to questions:
Is this the way DTC work? Is this normal for both transaction parties to do commits, while completion is not reported back to the coordinator?
If yes, does it mean I should overcome such events by altering logic of the program?
Am I misusing transactions somehow? What would be the right way?
In addition to the comments mentioned by Evk in Distributed transaction with MSMQ and SQL Server but sometimes getting dirty reads here's also an excerpt from the particular documentation page about transactions:
A distributed transaction between the queueing system and the persistent storage guarantees atomic commits but guarantees only eventual consistency.
Two additional notes:
NServiceBus uses IsolationLevel.ReadCommitted by default for the transaction used to consume messages. This can be configured although I'm not sure whether setting it to serialized on the consumer would really solve the issue here.
In general, it's not advised to use a shared database between services as this highly increases coupling and opens the door for issues like you're experiencing here. Try to pass relevant data as part of the message and keep the database an internal storage for one service. Especially when using web servers, a common pattern is to add all the relevant data to a message and fire it while confirming success to the user (as the message won't be lost) while the receiving endpoint can store the data to it's database if necessary. To give more specific recommendations, this requires more knowledge about your domain and use case. I can recommend the particular discussion community to discuss design/architectural question like this.
Is there a way to programmatically check if Service Fabric is up and running from an external client? I thought about just using try catch blocks but I'm curious whether there's a more elegant way of doing it.
There is no single "is the cluster running?" command. There are different ways that question can be interpreted.
But for an external client, the simplest check you can do is to simply try to communicate with the cluster. The cluster may be "running" but if your client can't communicate with it then it can stop right there. To do that programmatically you do have to catch a communication exception. Here is an example:
FabricClient client = new FabricClient();
try
{
await client.QueryManager.GetProvisionedFabricCodeVersionListAsync();
}
catch (FabricTransientException ex)
{
if (ex.ErrorCode == FabricErrorCode.CommunicationError)
{
// can't communicate with the cluster!
}
}
You're basically waiting for the connection to time out, so it will take a few seconds for this check to complete.
If your question is: "How can I check if My Service in Azure Service Fabric is up and running?" then you have some options. (If your question on the other hand is: "How can I check if Azure Service Fabric cluster
is up and running?" then you should look at Vaclav's answer.)
Create a new instance of FabricClient, now you can do a lot of fun stuff, including checking the health of your services.
If your cluster is secured (which is recommended) then you need to supply the X509Credentials. You can follow this article for that https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-connect-to-secure-cluster.
var connection = "myclustername.westeurope.cloudapp.azure.com:19000";
var fabricClient = new FabricClient(GetCredentials(clientCertThumb, serverCertThumb, CommonName), connection);
Using the associated manager clients you can check the health state of services deployed to your cluster.
Using the FabricClient.ServiceHealthManager to check individual partitions:
var serviceHealth = await fabricClient.HealthManager.GetServiceHealthAsync(serviceName);
foreach (var serviceHealthPartitionHealthState in serviceHealth.PartitionHealthStates)
{
Console.WriteLine($"Health state: {serviceHealthPartitionHealthState.PartitionId} {serviceHealthPartitionHealthState.AggregatedHealthState}");
}
Using FabricClient.QueryManagerClient to check aggregated health of each service:
var serviceList = await fabricClient.QueryManager.GetServiceListAsync(applicationName);
foreach (var service in serviceList)
{
Console.WriteLine($"\tFound service {service.ServiceName} {service.ServiceStatus} {service.HealthState}");
}
If any of these evaluate to anything else than System.Fabric.Health.HealthState.Ok then your service might have some problems.
I have a cache instance running on Windows Azure. I'm connecting to it from my web application and getting intermittent exceptions with the following message:
ErrorCode:SubStatus:There is a temporary failure.
Please retry later. (One or more specified cache servers are
unavailable, which could be caused by busy network or servers. For
on-premises cache clusters, also verify the following conditions.
Ensure that security permission has been granted for this client
account, and check that the AppFabric Caching Service is allowed
through the firewall on all cache hosts. Also the MaxBufferSize on the
server must be greater than or equal to the serialized object size
sent from the client.). Additional Information : The client was trying
to communicate with the server:
net.tcp://myserver.cache.windows.net:22234.
I've been able to duplicate the problem with this snippet in LinqPad
var config = new DataCacheFactoryConfiguration
{
AutoDiscoverProperty = new DataCacheAutoDiscoverProperty(true, "myserver.cache.windows.net"),
SecurityProperties = new DataCacheSecurity("key", false)
};
var factory = new DataCacheFactory(config);
var client = factory.GetDefaultCache();
//client.Put("foo", "bar");
for (int i = 0; i < 100; i++)
{
System.Threading.Tasks.Task.Factory.StartNew(o => {
var i1 = (int)o;
try
{
client.Get("foo").Dump();
} catch (Exception e)
{
e.Message.Dump();
}
}, i);
}
If I run this snippet as-is, spawning more than about 50 threads, I get the error. If I uncomment the initial Put(), I can run it with 10,000 threads. I make sure the entry is in the cache regardless before I run this. I've tried using pessimistic locking and it does not seem to have any effect. I've used the latest client DLLs from NuGet. I've tried scaling the cache up to 1GB with no other usage besides this snippet.
Since my requests in my web app are coming in on different threads, I believe this reasonably simulates what's happening in my app. And I'm definitely getting the same exception in both cases. Can anyone suggest a way to avoid this exception? Does it have to do with the initial Put() happening on the same thread as the constructor? That seems unlikely but it's the only thing I can do in this test scenario to eliminate the exception.
I have a multithread application with threadstatic sessions that does some job with files. It's using NH to consume from services and running on an oracle db, so far so good.
Every thread has a verbose log that uses stateless session to be more lightweight. BTW when some files are processed I can see that lots of cursors are managed in oracle for log session.
For instance log:
324 SPC_LOG
310 SPC_LOG
121 SPC_LOG
and application itself:
31 SPC_PRODUCTION_LINE_TEST
27 SPC_PRODUCTION_LINE_TEST
21 SPC_PRODUCTION_LINE_TEST
This drives me to run out of Oracle cursors ORA-01000.
Does somebody has an idea about what could cause this? Are cursors related to inserts or only updates? I guess that every thread at the end of it's life closes all sessions, regular and stateless.
FYI I'm writing log this way:
In Session factory
public IStatelessSession GetUserStatelessContext(ConnectionStringSettings connection)
{
lock (Padlock)
{
string key = GetConnectionKey(connection);
if (StatelessSessions == null)
{
StatelessSessions = new Dictionary<string, IStatelessSession>();
}
if (!StatelessSessions.ContainsKey(key))
{
StatelessSessions.Add(key, Factories[connection.ConnectionString].OpenStatelessSession());
}
return StatelessSessions[key];
}
}
And writing in log:
using (ITransaction tx = this.LogProcessErrorRepository.BeginTransaction())
{
this.LogProcessErrorRepository.Add(log);
if (log.Informations != null)
{
foreach (AdditionalInformation info in log.Informations)
{
info.Text = this.OracleCLOBHack(info.Text);
this.AdditionalInformationRepository.Add(info);
}
}
tx.Commit();
}
For the record the cause of the issue was the usage of the MS Oracle Client (System.Data.OracleClient) instead of the Oracle Data Provider (Oracle.DataAccess). In fluent is easy to confuse as the first one is OracleClientConfiguration and the ODP.Net OracleDataClientConfiguration as we were aware that the MS client is discontinued.
Right now database performance has increased in 400% and there is no cursor leakage at all. So from my point of view never use the MS client.