Quarantined nodes in Windows Failover Clusters:
I’ve observed this issue on Windows Server 2016, where one of the nodes were quarantined after many node failover attempts in an hour. For the following few hours, the WSFC prevents the node from rejoining the cluster. This could be due to a network issue that I’ve noticed in the current environment. Quarantined receives a ping response from one node. Intermittently, the ping response from one node to the quarantined issue node receives a Timeout response.
For more details go through below link it helped me a lot :
2. You will encounter an error, The primary replica is not Active .The command cannot be executed.
This is due to an issue with the endpoint port not listening on the correct port or the tcp endpoint being stopped for some reason.
Go through below link for more :
3. Unexpectedly, an availability group was missed, dropped, or removed.
This was caused by the fact that SQL was not being transmitted to WSFC. The SQL will then delete the Availability group.
Please go through below link for more details:
4. Availability group is in Resolving State :
Issue : For a few minutes, there was a cluster failure that affected the availability group, and the replicas went into a resolving state. The replicas returned to their regular primary and secondary states once the cluster was brought back online, however several databases were still not synchronising. Furthermore, the databases on the primary were unavailable.
Resolution : The only remedy we found was to restart the primary replica’s SQL instance. Even a simple restart of the SQL service through the config manager, however, was stuck on “stopping service.” We had to use the TSQL command “SHUTDOWN WITH NOWAIT” to make SQL to stop. The databases were available again after SQL was brought back up, and AG was in sync and healthy.
For further details go through below link :
Issue : The secondary replica does not correctly transition to the primary role if an automated failover event fails. As a result, the availability replica will indicate that this replica is in the process of resolving. Furthermore, the availability databases state that they are not synchronising, and apps are unable to access them.
Reasons to occur :
Case 1 : The value of “Maximum Failures in the Specified Period” has been reached.
The clustered resource will fail three times in a six-hour period if the default behaviour is followed. RESOLVING STATE is applied to the AG replica.
Case 2 : Local SQL Server NT AUTHORITY\SYSTEM login account has insufficient NT Authority SYSTEM account rights. The following permissions are provided by default to this local login account:
Alter Any Availability Group
View server state
Case 3 : If one of the availability databases in the availability group is in the synchronizing or not synchronized state, automatic failover will not be able to successfully transition the secondary replica into the primary role.
To get more details check below link :
5 . Diagnose Unexpected Failover or Availability Group in RESOLVING State
Lease Timeout : A lease timeout can be triggered if SQL Server does not react within the normal 20-second lease timeout period.
Lease Timeout CAUSE – 100% CPU Utilization: A lease timeout might occur if CPU utilization is extremely high for an extended length of time. Using Performance Monitor, keep an eye on your CPU usage.
Sync Issues :
how to troubleshoot Always On synchronization issue?
Multiple reasons for the database status changing to not in sync
- Network Issue
- Huge transactions
- Space Issues