We had a strange issue with the Always on 3-Node replica today.
I got onto the server to Implement a change, which was to apply a SQL Patch to SQL Server 2016 SP2.
I have discovered that my AG’s health isn’t looking well though, and the databases in that secondary replica’s state are in RESTORING mode.
If a secondary replica in an Always On availability group is in a DISCONNECTED state, it means that it is no longer able to communicate with the primary replica or other secondary replicas in the availability group. This can be caused by a number of issues, such as network connectivity problems, configuration issues, or resource contention on the secondary replica.
Here are a few steps you can take to troubleshoot the issue:
- Check the network connectivity: Make sure that the secondary replica has a stable and reliable network connection to the primary replica and other secondary replicas. Check for any network configuration issues or problems with the network equipment.
- Check the availability group configuration: Make sure that the availability group configuration is correct and that the secondary replica is correctly configured to participate in the availability group.
- Check the resource utilization: Make sure that the secondary replica has sufficient resources, such as CPU and memory, to keep up with the replication workload.
- Check the error log: Check the SQL Server error log on the secondary replica to see if there are any error messages related to the disconnected state.
- Check the Windows Application Event log: Check the Windows Event log for any events that may indicate the reason for the disconnected state.
- Check the status of the synchronization: Check the status of the data synchronization between the primary and secondary replica. You can use the sys.dm_hadr_database_replica_states dynamic management view to check the synchronization status.
- Check for any known bugs and troubleshoot accordingly: Check if there are any known bugs in the version of SQL Server you are running that could cause this issue.
- Try to re-connect the replica: If all the above steps fail, you can try to manually force a reconnection between the primary and secondary replica using the ALTER AVAILABILITY GROUP command or using the SSMS GUI.
- It’s important to note that some of these steps may require a downtime, so it’s recommended to have a plan in place before attempting any troubleshooting steps.
So, googled for below error:
Error Message 35201:
When attempting to connect to availability replica ‘My replica’ with id [availability group id], a connection timeout occurred. Either a network or firewall problem, or the replica’s endpoint address is not the DB mirroring endpoint of the host server instance.
There is no connection between this secondary replica and the primary replica. DISCONNECTED is the connected state.
I found the following KB article on support.miscrosoft.com.
The SQL instances version, SQL 2016 SP2, made it useless for my current organization, though.
As a result, we alerted the Windows Team, who are looking into it. We also discovered that some cluster issues had been recorded in Cluster Events on Windows Server Failover Cluster Manager.
Due to the fact that the data is not being moved to one of the secondary replicas, we are receiving disk space alerts in the interim.
In order to test if the issue was rectified, we requested permission from the Client to reboot the problematic secondary replica, as stated on the Microsoft page above.
The customer provided his okay, we rebooted the replica because the secondary replica isn’t being used in any way, but the issue still exists.
The SQL instances version, SQL 2016 SP2, made it useless for my current organization, though.
As a result, we alerted the Windows Team, who are looking into it. We also discovered that some cluster issues had been recorded in Cluster Events on Windows Server Failover Cluster Manager.
Due to the fact that the data is not being moved to one of the secondary replicas, we are receiving disk space alerts in the interim.
In order to test if the issue was rectified, we requested permission from the Client to reboot the problematic secondary replica, as stated on the Microsoft page above.
The customer provided his okay, we rebooted the replica because the secondary replica isn’t being used in any way, but the issue still exists.