Today we have faced some weird issue on Always on 3-Node replica
I have logged into Server to perform a change which is applying SQL Patch to SQL Server 2016 SP2
I have discovered that my AG health is not looking Good and the databases State showing in that Secondary replica are in RESTORING Mode.
Also When I expand AVAILABILITY GROUPS Folder in problematic secondary replica I am not seeing any folders inside that as shown in below snip.
When I perform RDP to Problematic Secondary replica and from there connected all 3 replicas sql instances from SSMS I am not able to see the PRIMARY AG as shown in below snip:
For a little bit I am scared what happened to PRIMARY where the Primary AG resides then I opened cluadmin and saw the Owner of AG was there in DR server …So, I took RDP to the DR server and connected the SQL Instance to see the PRIMARY AG and it was there, Health is all Good for PRIMARY replica , one of the secondary replica but the other problematic secondary replica is showing DISCONNECTED State .
I have gone through the SQL Error Logs if i can find any errors related to our issue ..I found below error
Other than that i didn’t see any other errors which are useful to troubleshoot the issue.
So, googled for below two errors :
Message 35201: A connection timeout has occurred while attempting to establish a connection to availability replica ‘replicaname’ with id [availability_group_id]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.
This secondary replica is not connected to the primary replica. The connected state is DISCONNECTED.
I found below KBarticle from support.miscrosoft
But it was not helpful for my current organization as the SQL instances version is SQL 2016 SP2.
We also seen some Cluster issues were recorded in Cluster Events on Windows Server Failover Cluster Manager and informed the Windows Team to look into that and they are checking on it .
In the mean time we are getting alerts on Disk Space issues as the Data is not getting moved to one of the Secondary replica .
So, as suggested in the above Microsoft link we have taken permission from Client to Reboot the problematic secondary replica and see if the issue fixes .
As the secondary replica is not used anyhow client approved and we rebooted the replica but still the issue persists
Now , we informed to client saying we will try REMOVING the problematic secondary replica and add it back again anyhow it will not impact the production but takes time to get the Databases SYNC …Client has approved .
We had REMOVED and added back the problematic secondary replica and it was connected without any issues.
Databases are getting SYNC , it took sometime to get SYNCHRONIZED.
Now the Dashboard showing all the 3 replicas are Healthy and Databases are in SYNC.
Windows Team is investigating on cluster Issues as of now we don’t have any update .
Also as per above Microsoft link this type of issues occurs on very busy servers ..