Intro:
In our environment, where AlwaysON Availability Groups were set up on Windows Server 2016 and SQL Server 2016, we came across an interesting problem when windows team is applying Patching to Windows Server
Issue:
I came into an exciting circumstance where operating system patching is ongoing and we are working on AG failover and failback. After experiencing some network disruption, I was unable to observe the AG dashboard healthy for one of the nodes. Despite the fact that the Availability Groups are in the Resolving State, I noticed that SQL is operational. I instantly glanced at WSFC and saw that one of the nodes is not functioning and is reporting an error. So, I made an attempt to quickly get my cluster node back online using the standard technique, but it was unsuccessful. A quick review of the cluster, The cluster event log contained the following error notice, which was quickly identified:
The node’s WSFC status is also quarantined. This subject is so fascinating!
What would happen to my availability group if the node in my case didn’t join the cluster automatically till 02:03:26? The quarantined cluster node, as seen in the WSFC, signifies a synchronization problem as well as a detached availability replica.
The message is accompanied by the error number 100060, which we can see. A connection attempt failed because the connected host did not react after a certain amount of time, or an established connection failed because the connected host did not respond, according to the error code 10060, which was encountered when receiving data. The SQL Server error log makes no explicit mention of the quarantine status. The secondary replica sent me the following sample message, which I found in the SQL Server error log:
Conclusion:
SQL Server is awaiting the cluster node’s WSFC startup and membership. In conclusion, the isolated node is up and running, hence the availability health state won’t change. Due to the issue, the node will automatically join the cluster, which can be advantageous until the related problem on the impacted cluster node is fixed. Thankfully, the Microsoft whitepaper indicates that we won’t need to wait for the quarantine period to end.
Use the PowerShell command below to release the quarantined State of Node:
Start-ClusterNode -Clearquarantine
After running above command in PowerShell Let’s get going! The Node has been returned to its previous state, according to a current check of the cluster’s health status.