I am using Akka.Net in a very simple client server configuration. Nothing very advanced at this time. After about 3 or 4 days of sending messages back and forth it seems that the entire system gets in a disconnection state. With a restart of the services everything reconnects and there are no issues. Prior to this things will disconnect however it seems to reconnect right away.
During this time both machines are accessible on the network and don't seem to have any actual connection problems.
I am not sure where to go from here.
Client config (Server very similar)
return ConfigurationFactory
.ParseString(string.Format(#"
akka {{
loggers = [""XYZ.AkkaLogger, XYZ""]
actor {{
provider = ""Akka.Remote.RemoteActorRefProvider, Akka.Remote""
serializers {{
json = ""XYZ.AkkaSerializer, XYZ""
}}
}}
remote {{
helios.tcp {{
transport-class = ""Akka.Remote.Transport.Helios.HeliosTcpTransport, Akka.Remote""
applied-adapters = []
transport-protocol = tcp
port = 0
hostname = {0}
send-buffer-size = 512000b
receive-buffer-size = 512000b
maximum-frame-size = 1024000b
tcp-keepalive = on
}}
transport-failure-detector {{
heartbeat-interval = 60 s # default 4s
acceptable-heartbeat-pause = 20 s # default 10s
}}
}}
stdout-loglevel = DEBUG
loglevel = DEBUG
debug {{
receive = on
autoreceive = on
lifecycle = on
event-stream = on
unhandled = on
}}
}}
", Environment.MachineName));
This cycle is pretty sporadic at first however after a while it repeats and nothing connects anymore until a reset of the service.
WARN 2015-07-31 07:22:12,994 [1584] - Association with remote system akka.tcp://SystemName#Server:8081 has failed; address is now gated for 5000 ms. Reason is: [Disassociated]
ERROR 2015-07-31 07:22:12,994 [1584] - Disassociated
Akka.Remote.EndpointDisassociatedException: Disassociated
at Akka.Remote.EndpointWriter.PublishAndThrow(Exception reason, LogLevel level)
at Akka.Remote.EndpointWriter.Unhandled(Object message)
at Akka.Remote.EndpointWriter.Writing(Object message)
at Akka.Actor.ActorCell.<>c__DisplayClass3e.<Akka.Actor.IUntypedActorContext.Become>b__3d(Object m)
at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
at Akka.Actor.ActorCell.ReceiveMessage(Object message)
at Akka.Actor.ActorCell.AutoReceiveMessage(Envelope envelope)
at Akka.Actor.ActorCell.Invoke(Envelope envelope)
DEBUG 2015-07-31 07:22:12,996 [1494] - Disassociated [akka.tcp://SystemName#Client:57284] -> akka.tcp://SystemName#Server:8081
DEBUG 2015-07-31 07:23:13,033 [1469] - Drained buffer with maxWriteCount: 50, fullBackoffCount: 1,smallBackoffCount: 0, noBackoffCount: 0,adaptiveBackoff: 10000
ERROR 2015-07-31 07:24:13,019 [1601] - No response from remote. Handshake timed out or transport failure detector triggered.
DEBUG 2015-07-31 07:24:13,020 [1569] - Disassociated [akka.tcp://SystemName#Client:57284] -> akka.tcp://SystemName#Server:8081
WARN 2015-07-31 07:24:13,020 [1601] - Association with remote system akka.tcp://SystemName#Server:8081 has failed; address is now gated for 5000 ms. Reason is: [Disassociated]
ERROR 2015-07-31 07:24:13,021 [1601] - Disassociated
Related
When sending data to kafka using an async, idempotent producer, we received errors showing below. We restarted producer applications and resolved an issue. We opened a ticket with Confluent - Ticket Link. They mentioned that it's an existing bug in the Kafka client we use and suggested by restarting producers or resetting the transactional producer will help to address the issue in mean time they release new client updates (not in their release pipeline yet).
We tried to reproduce the issue in our dev and QA environments with the higher load but could not re-produce it. Does anyone have an idea how can we possible re-produce the same issue? Also, if are there any other suggestions to handle this issue in production?
Error, we received in our producer application, most likely caused by the confluent-kafka-dotnet client we use. -
"%3|1668238507.285|ERROR|<PUBLISHER_APP_NAME>#producer-1|: Fatal error: Broker: Broker received an out of order sequence number: ProduceRequest for <TOPIC_NAME> [38] with 1 message(s) failed due to sequence desynchronization with broker 1 (PID{Id:22181,Epoch:0}, base seq 0, idemp state change 8184098ms ago, last partition error NO_ERROR (actions , base seq 0..0, base msgid 0, -1ms ago)"
Our Producer configuration looks like:
config = new ProducerConfig {
BootstrapServers = appConfiguration.KafkaBootStrapServers,
SecurityProtocol = SecurityProtocol.Ssl,
Acks = Acks.All,
EnableIdempotence = true,
ClientId = appConfiguration.KafkaClientID
};
Our Asp.net C# Producer method call looks like:
try {
Headers headers = new Headers();
headers.Add(<CONSTANT>.ATT_TRANSACTION_ID, Encoding.ASCII.GetBytes(transactionId));
DeliveryResult<string, string> response =
await kafkaProducer.ProduceAsync(topic, new Message<string, string> { Key = key, Value = eventData, Headers = headers });
return response; // logging
}
catch (ProduceException<string, string> ex)
{
// catch exception, processing and logging
return null;
}
catch (Exception ex)
{
// catch exception and logging
return null;
}
Environment -
Kafka cluster (v2.8.0) with 3 brokers and in-sync replicas
Confluent.Kafka nuget library 1.7.0 with librdkafka.redist v1.7.0
I use IBM XMS to connect to a third party to send and receive messages.
UPDATE:
Client .Net Core 3.1
IBM XMS library version from Nuget. Tried 9.2.4 and 9.1.5 with same results
Same code used to work fine a week ago - so something must have changed in the MQ manager or somewhere in my infrastructure
SSL and client certificates
I have been using a receive with timeout for a while without problems but since last week I started to not see any messages to pick - even when they were there - but once I changed to the not timeout receive method I started again to pick messages every 5 minutes.
Looking at the XMS logs I can see the messages are actually read almost immediately with and without timeout but that XMS seems to be deciding to wait for those 5 minutes before returning the message...
I haven't changed anything in my side and the third party reassures they haven't either.
My question is: given the below code used to receive is there anything there that may be the cause of the 5 minutes wait? Any ideas on things I can try? I can share the XMS logs too if that helps.
// This is used to set the default properties in the factory before calling the receive method
private void SetConnectionProperties(IConnectionFactory cf)
{
cf.SetStringProperty(XMSC.WMQ_HOST_NAME, _mqConfiguration.Host);
cf.SetIntProperty(XMSC.WMQ_PORT, _mqConfiguration.Port);
cf.SetStringProperty(XMSC.WMQ_CHANNEL, _mqConfiguration.Channel);
cf.SetStringProperty(XMSC.WMQ_QUEUE_MANAGER, _mqConfiguration.QueueManager);
cf.SetStringProperty(XMSC.WMQ_SSL_CLIENT_CERT_LABEL, _mqConfiguration.CertificateLabel);
cf.SetStringProperty(XMSC.WMQ_SSL_KEY_REPOSITORY, _mqConfiguration.KeyRepository);
cf.SetStringProperty(XMSC.WMQ_SSL_CIPHER_SPEC, _mqConfiguration.CipherSuite);
cf.SetIntProperty(XMSC.WMQ_CONNECTION_MODE, XMSC.WMQ_CM_CLIENT);
cf.SetIntProperty(XMSC.WMQ_CLIENT_RECONNECT_OPTIONS, XMSC.WMQ_CLIENT_RECONNECT);
cf.SetIntProperty(XMSC.WMQ_CLIENT_RECONNECT_TIMEOUT, XMSC.WMQ_CLIENT_RECONNECT_TIMEOUT_DEFAULT);
}
public IEnumerable<IMessage> ReceiveMessage()
{
using var connection = _connectionFactory.CreateConnection();
using var session = connection.CreateSession(false, AcknowledgeMode.AutoAcknowledge);
using var destination = session.CreateQueue(_mqConfiguration.ReceiveQueue);
using var consumer = session.CreateConsumer(destination);
connection.Start();
var result = new List<IMessage>();
var keepRunning = true;
while (keepRunning)
{
try
{
var sw = new Stopwatch();
sw.Start();
var message = _mqConfiguration.ConsumerTimeoutMs == 0 ? consumer.Receive()
: consumer.Receive(_mqConfiguration.ConsumerTimeoutMs);
if (message != null)
{
result.Add(message);
_messageLogger.LogInMessage(message);
var ellapsedMillis = sw.ElapsedMilliseconds;
if (_mqConfiguration.ConsumerTimeoutMs == 0)
{
keepRunning = false;
}
}
else
{
keepRunning = false;
}
}
catch (Exception e)
{
// We log the exception
keepRunning = false;
}
}
consumer.Close();
destination.Dispose();
session.Dispose();
connection.Close();
return result;
}
The symptoms look like a match for APAR IJ20591: Managed .NET SSL application making MQGET calls unexpectedly receives MQRC_CONNECTION_BROKEN when running in .NET Core. This impacts messages larger than 15kb and IBM MQ .net standard (core) libraries using TLS channels. See also this thread. This will be fixed in 9.2.0.5, no CDS release is listed.
It states:
Setting the heartbeat interval to lower values may reduce the frequency of occurrence.
If your .NET application is not using a CCDT you can lower the heartbeat by having the SVRCONN channel's HBINT lowered and reconnecting your application.
I'm trying to load test a basic API, I've started getting some strange issues coming out of the database connection.
I've now narrowed it down to the SQL connection itself. (I'm using SELECT 1 to test connection only)
under very low load (15 calls per second) everything works exactly as expected.
under low load (25 calls per second) the first 4-5 calls come back at an okay speed, then slowing down rapidly. A lot of calls timing out due to no connection in the pool.
under medium load (50 calls per second) everything locks up entirely, nothing comes back. And I start to get strange things like A network-related or instance-specific error occurred while establishing a connection to SQL Server. coming up. Cannot get a connection from the pool again.
exec sp_who2 on the server shows no connections from dotnet either.
To make it worse the only way to recover from this is to bounce the entire service.
I have ruled out the server itself because this is happening on a powerful SQL server on-prem, an azureSql database, and a local service running on docker.
int selected = 0;
var timer = Stopwatch.StartNew();
using (SqlConnection connection = CreateNewConnection())
{
try
{
connection.Open();
selected = connection.QueryFirst<int>("SELECT 1");
timer.Stop();
}
catch (Exception e)
{
Console.WriteLine("Failed connection");
Console.WriteLine("fatal " + e.Message);
responseBuilder.AddErrors(e);
}
finally
{
connection.Close();
}
}
responseBuilder.WithResult(new {selected, ms = timer.ElapsedMilliseconds});
I've even tried disposing, and forcing the connection close manually to understand what is going on.
This is running dotnet core, and dapper (I get the same issues even without dapper)
I've also tried upping the max connection pool limit to absurd numbers like 1000, and there was no effect.
edit
After trying a bit more, I decided to try with Postgres. Which works perfectly at over 1k calls per second.
Am I missing something on in sql server itself? or on the connection?
Something to point out, These are shotgun calls. So a batch gets fired off as fast as possible, then wait for each request to return after.
Also this is using linux (and environments are docker k8s)
Someone wanted to know how connections got created
private IDbConnection CreateNewConnection()
{
var builder = new SqlConnectionStringBuilder()
{
UserID = "sa",
Password = "012Password!",
InitialCatalog = "test",
DataSource = "localhost",
MultipleActiveResultSets = true,
MaxPoolSize = 1000
};
return new SqlConnection(builder.ConnectionString);
}
Another note
Not shotgunning (waiting for the previous call to complete, before sending another) seems to have a decent enough throughput. It appears to be something with handling too many requests at the same time
Version Information
dotnet 2.1.401
SqlClient 4.5.1
I can verify something fishy is going on but it's probably not pooling. I created a console application and run it both from a Windows console and a WSL console on the same box. This way I was able to run the same code, from the same client but different OS/runtime.
On Windows, each connection took a less than a millisecond even with an absurd 500 DOP :
985 : 00:00:00.0002307
969 : 00:00:00.0002107
987 : 00:00:00.0002270
989 : 00:00:00.0002392
The same code inside WSL would take 8 seconds or more, even with a DOP of 20! Larger DOP values resulted in timeouts. 10 would produce results similar to Windows.
Once I disabled MARS though performance went back to normal :
983 : 00:00:00.0083687
985 : 00:00:00.0083759
987 : 00:00:00.0083971
989 : 00:00:00.0083938
992 : 00:00:00.0084922
991 : 00:00:00.0045206
994 : 00:00:00.0044566
That's still 20 times slower than running on Windows directly but hardly noticable until you check the numbers side by side.
This is the code I used in both cases :
static void Main(string[] args)
{
Console.WriteLine("Starting");
var options=new ParallelOptions { MaxDegreeOfParallelism = 500 };
var watch=Stopwatch.StartNew();
Parallel.For(0,1000,options,Call);
Console.WriteLine($"Finished in {watch.Elapsed}");
}
public static void Call(int i)
{
var watch = Stopwatch.StartNew();
using (SqlConnection connection = CreateNewConnection())
{
try
{
connection.Open();
var cmd=new SqlCommand($"SELECT {i}",connection);
var selected =cmd.ExecuteScalar();
Console.WriteLine($"{selected} : {watch.Elapsed}");
}
catch (Exception e)
{
Console.WriteLine($"Ooops!: {e}");
}
}
}
private static SqlConnection CreateNewConnection()
{
var builder = new SqlConnectionStringBuilder()
{
UserID = "someUser",
Password = "somPassword",
InitialCatalog = "tempdb",
DataSource = #"localhost",
MultipleActiveResultSets = true,
Pooling=true //true by default
//MaxPoolSize is 100 by default
};
return new SqlConnection(builder.ConnectionString);
}
}
I have a Web Api Service that sends signalR notifications to a proxy I have running on an MVC website that sits on the same box. After lots of tweaks I have got the web service talking to the hub on the website, confirmed by debug logging.
On the website side, it seems like everything is being called fine and it seems to have a reference to my client browser that is calling the website. Below I'll paste the Hub class and the debug log output to prove this.
The final line is
Clients.Client(conn).addNotification(notification);
which I assume then looks for a JavaScript method matching that signature. Of which I have one. Below is the method in my JS files.
notifyProxy.client.addNotification = function (notification) {
$notificationTableBody.prepend(rowTemplate.supplant(formatNotification(notification)));
};
However setting a break point on this (Using chrome developer tools), the method never gets called. No errors are reported from the MVC website and none in the developer tools console, though I don't know if SignalR writes errors elsewhere.
The only other thing of note is in the JS file I have a line
$.connection.hub.start().done(init);
Which gets called when I set a breakpoint in Chrome developer tools. However the init() method never gets called. Does this mean potentially the hub.start() method is failing? If so how can I find out why?
What is going wrong? I am quite new to signalR so am losing my mind!
Code for my hub class and debug output below
public void PushNotification(EventNotification notification, List<string> users)
{
_logger.WriteDebug("Entered push notification");
if (notification.FriendlyText.Length > 1000)
notification.FriendlyText = notification.FriendlyText.Substring(0, 1000) + "...";
if (users.Contains("*") && users.Count > 1)
{
users.RemoveAll(u => !u.Equals("*"));
}
foreach (var userName in users)
{
_logger.WriteDebug("Push notification for " + userName);
UserConnectionInformation user = null;
if (UserInformation.TryGetValue(userName.ToUpperInvariant(), out user))
{
_logger.WriteDebug("Adding " + notification.FriendlyText + ", " + user.ConnectionIds.Count + " connections");
user.Notifications.Add(notification);
lock (user.ConnectionIds)
{
foreach (var conn in user.ConnectionIds)
{
_logger.WriteDebug("Is Client null - {0}", Clients.Client(conn) == null);
_logger.WriteDebug("Conn {0}", conn);
var proxy = Clients.Client(conn) as ConnectionIdProxy;
_logger.WriteDebug("proxy {0}", proxy.ToString());
Clients.Client(conn).addNotification(notification);
_logger.WriteDebug("Added notification ");
}
}
}
}
}
11-12-2015 12:19:43,808 [UK\kerslaj1][20] DEBUG Centrica.CE.SEFlex.Common.Logging.ConsoleLogger - Entered push notification
11-12-2015 12:19:43,808 [UK\kerslaj1][20] DEBUG Centrica.CE.SEFlex.Common.Logging.ConsoleLogger - Push notification for kerslaj1
11-12-2015 12:19:43,808 [UK\kerslaj1][20] DEBUG Centrica.CE.SEFlex.Common.Logging.ConsoleLogger - Push notification for UK\kerslaj1
11-12-2015 12:19:43,808 [UK\kerslaj1][20] DEBUG Centrica.CE.SEFlex.Common.Logging.ConsoleLogger - Adding Job 'testrose45job-11dec15121944' ('ROSE-0000045') cancellation has been requested by 'UK\kerslaj1' on 'seflexpool3'., 1 connections
11-12-2015 12:19:43,808 [UK\kerslaj1][20] DEBUG Centrica.CE.SEFlex.Common.Logging.ConsoleLogger - Is Client null - False
11-12-2015 12:19:43,808 [UK\kerslaj1][20] DEBUG Centrica.CE.SEFlex.Common.Logging.ConsoleLogger - Conn 3d008947-0d96-4965-bb01-dfd1517c24a5
11-12-2015 12:19:43,808 [UK\kerslaj1][20] DEBUG Centrica.CE.SEFlex.Common.Logging.ConsoleLogger - proxy Microsoft.AspNet.SignalR.Hubs.ConnectionIdProxy
11-12-2015 12:19:44,803 [UK\kerslaj1][20] DEBUG Centrica.CE.SEFlex.Common.Logging.ConsoleLogger - Added notification
See if you can get errors back from the proxy:
$.connection.hub.error(function (error) {
console.log('SignalR error: ' + error)
});
Also you might need to enable client side logging before starting the connection:
// enable client side logging
$.connection.hub.logging = true;
$.connection.hub.start()
.done(function(){ console.log('Now connected, connection ID=' + $.connection.hub.id); })
.fail(function(){ console.log('Could not Connect!'); });
});
Let's see if we get any further from there and if the start of the hub connection is the actual problem.
library: clrzmq4 (https://github.com/zeromq/clrzmq4) in a c# project.
I am using zmq router-dealer configuration. The server is written in python and runs on linux. My dealer client written in c# runs on a windows machine. It sends messages and waits from the response
public Boolean sendMessage(Dictionary<String, String> msgDict)
{
ZError err;
String errStr;
var reqFrame = new ZFrame(JsonConvert.SerializeObject(msgDict));
retval = socket.Send(reqFrame, out err);
if (err != null)
{
errStr = String.Format("Error while sending command {3} {0} {1}", err.Text, err.Number, err.Name);
return false;
}
err = null;
respFrame = socket.ReceiveFrame(out err);
if (err != null)
{
errStr = String.Format("Error while receiving response data {0} {1} {2} {3}", err.Text, err.Number, err.Name, num_messages);
return false;
}
return true;
}
I set the sendTimeout and receiveTimeout on the socket to 2 min each.
When I keep calling sendMessage, exactly at the 255th time, receiveFrame timesout . On the server I see the message being processed and response being sent like everytime. And after this point, my send also timesout with the same error "EAGAIN" Resource temporarily unavailable.
There are the things I tried
Data with different lengths from 2 KB to 20 MB
set the sendhighwatermark and receivehighwatermark to different values: 10, 1000, 10000
Tried polling on the socket instead of ReceiveFrame
Tried making the sockets completely blocking.
In each of the above cases the failure occured at exactly the 255th time. In case of blocking sockets, it got blocked at the 255th time too.
I can't use netmq as much as I would like to because it doesn't have curvezmq and the server needs it.
I also tried a dealer client from another linux machine and it had no issues 255th time or even later.