Recently in our environment we saw weird issue where we have Windows Server 2016 + SQL Server 2016 configured with AlwaysON Availability Groups.
I noticed an interesting scenario when operating System patching is going on and We are working on doing AG failover and failback. After simulating some network outage scenarios, I was not able to see the AG dashboard healthy for one of node .I see SQL is up and running good but the Availability Groups are in Resolving State , immediately i have checked in WSFC and saw one of the Node is showing Error and is not UP.So, tried to bring back my cluster node online immediately by using traditional way but it didn’t came up. A quick look at the cluster event log led me to notice some error message as shown below :
Also in WSFC the Node status showing as Quarantined…Interesting Issue!!
Here, the Node will not automatically join the cluster until 02:03:26 in my case and what will happen to my availability group? Well no surprise here, the quarantined cluster node means an availability replica disconnected and a synchronization issue as shown in below in Screen shot:
We notice a corresponding error number 100060 with the message An error occurred while receiving data: ‘10060(A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)’. There is no specific message from SQL Server error log about quarantine state. From the secondary replica, I got the following sample message into the SQL Server error log:
SQL Server is waiting for the cluster node to start and rejoin the WSFC. In short, overall the quarantined node is active so the availability health state will not change . As pet the error the node will join automatically to cluster , it may be a good thing until you don’t fix the related issue on the concerned cluster node. Fortunately, as stated by Microsoft document, this is not mandatory for us to wait for the quarantined period to finish.
We can use the following PowerShell command to come out of quarantined State of Node :
Lets go! Check the cluster health state now and the Node was got back to normal