Node Quarantine state Issue on WSFC 2016 with Availability groups( Win 2016 + SQL 2016 )

Recently in our environment we saw weird issue where we have Windows Server 2016 + SQL Server 2016 configured with AlwaysON Availability Groups.
I noticed an interesting scenario when operating System patching is going on and We are working on doing AG failover and failback. After simulating some network outage scenarios, I was not able to see the AG dashboard healthy for one of node .I see SQL is up and running good but the Availability Groups are in Resolving State , immediately i have checked in WSFC and saw one of the Node is showing Error and is not UP.So, tried to bring back my cluster node online immediately by using traditional way but it didn’t came up. A quick look at the cluster event log led me to notice some error message as shown below :

Also in WSFC the Node status showing as Quarantined…Interesting Issue!!

Here, the Node will not automatically join the cluster until 02:03:26 in my case and what will happen to my availability group? Well no surprise here, the quarantined cluster node means an availability replica disconnected and a synchronization issue as shown in below in Screen shot:

We notice a corresponding error number 100060 with the message An error occurred while receiving data: ‘10060(A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)’. There is no specific message from SQL Server error log about quarantine state. From the secondary replica, I got the following sample message into the SQL Server error log:

SQL Server is waiting for the cluster node to start and rejoin the WSFC. In short, overall the quarantined node is active so the availability health state will not change . As pet the error the node will join automatically to cluster , it may be a good thing until you don’t fix the related issue on the concerned cluster node. Fortunately, as stated by Microsoft document, this is not mandatory for us to wait for the quarantined period to finish.

We can use the following PowerShell command to come out of quarantined State of Node :

Start-ClusterNode -Clearquarantine

Lets go! Check the cluster health state now and the Node was got back to normal

Upgrade/migrate SQL Server 2014 Availability group(s) running on Windows Server 2012 R2 to SQL Server 2019 running on Windows Server 2016

What we are performing today ?

To upgrade/migrate (side-side) SQL Server 2014 Availability group(s) running on Windows Server 2012 R2 to SQL Server 2019 running on Win 2016 with the least amount of downtime.

My Current Environment Setup :

I have a 2 node Windows Server 2012 R2 Failover cluster hosting SQL Server 2014 Always On Availability Group with synchronous commit mode. I have a listener configured for my applications to connect. These 2 replicas are running SQL 2014 with Latest builds.

Replica(Node) Details are below :


Above are my two replicas(Nodes) which are running with WindowsServer2012R2+SQL 2014.

Our main Agend is
I have SQL 2014 AG running on Windows Server 2012R2 which needs to be upgraded/migrated to SQL Server 2019 running on Windows Server 2016 with a very minimal downtime and no configuration changes for the Application teams.

Note : Here I am not performing In-place upgrade.Which requires more downtime.So, let’s assume In-place upgrades are not allowed.

High Level Implementation Plan :

. Get ready with 2 new builds of Windows Server 2016 (W16SQL2019A & W16SQL2019B)

. Install SQL Server 2019 on both win 2016 servers

. Add W16SQL2019A and W16SQL2019B nodes to the same windows cluster

. Enable Alwayson HADR Feature in SQL 2019

. Add the 2 nodes as replicas in SQL Server Alwayson Availability Group

. Join the Databases .

. Check the Dashboard of AG and make sure all looks healthy and Green

. At the change window of final cutover date/time, failover SQL Server 2014 to SQL Server 2019 and remove the old replicas( SQL Server 2014) from AG.

. And in Windows Server 2012R2 EVICT both nodes from the Cluster

Important things be a taken into Consideration after Successful configuration

Adding Win2016 nodes to existing Win2012R2 Same cluster is Leveraging MIXED MODE. Please don’t run this mixed mode WSFC for a long period. Microsoft may not support if you stay in mixed mode for more than 4 weeks.This blog is only to perform rolling upgrades to make your systems are really highly available. Try to wrap up the entire process in a day or two and make sure to be done with it.

Sometimes after adding the new replicas the Databases synchronization will not start automatically. you need to Join each databases Manually to get it SYNCHRONIZED.

Change the failover mode to manual for all replicas just to make sure cluster has no control over failing over my AG automatically.

At the final cut over before Failover the AG make sure to change Availability Mode to SYNCHRONOUS COMMIT mode and then proceed with your failover of AG from W12SQL2014A( Which is a Primary Replica) to W16SQL2019A.

At this point, W16SQL2019A took over the primary role of all your databases participating in your AG have been upgraded to SQL 2019 and the other SQL 2019(W16SQL2019B ) Instance will also be in sync

But the two SQL 2014 Instances will be in unhealthy state, In fact those databases will become inaccessible at this time, since Logs cannot be shipped from higher(2019) to lower(2014) version.

That mean the Databases will not be SYNCHRONIZED on SQL 2014 Instances

I did a Cleanup by removing both SQL 2014 Instances from AG as replicas and EVICTED the W12SQL2014 Nodes from WSFC.