By creating new Windows Server 2016 operating systems and configuring Alwayson( 2 replicas) for SQL 2016 servers with multiple Availability Groups, we just completed a side-by-side migration from SQL Server 2008R2 Standalone databases (30 DB’s) to SQL Server 2016.
The application team changed their connection string to connect with’ListnerName’and checked the app’s functionality. The app team confirmed that everything is working as intended and that everything is fine.
After one week, I received a call from the application team (in the middle of the night) stating that they were experiencing connectivity issues when connecting to databases.
The Primary Availability Groups alwayson Dashboard seems HEATHY and GREEN when I signed in and verified the database server. ALL DATABASES ARE SYNCHRONIZED AND ARE AVAILABLE ONLINE.
Using sys.sysprocesses, I checked if any connections were being made to databases, and I noticed that application connections were being made to the database server.
As a result, I informed the application team, who responded that they had restarted their application services and that database connectivity was now working properly.
I checked the SQL Server Error Log and found the following messages:
SQL Server hosting availability group ‘XXX’ did not receive a process event signal from the Windows Server Failover Cluster within the lease timeout period.
Error: 19421, Severity: 16, State: 1.
The lease between the Windows Server Failover Cluster and availability group ‘AVG1’ has come to an end. There was a problem with communication between the SQL Server instance and the Windows Server Failover Cluster. Check the associated availability group resource in the Windows Server Failover Cluster to see if the availability group is failing over correctly.
Error: 19407, Severity: 16, State: 1.
The ‘PRIMARY NORMAL’ state of the local availability replica in availability group ‘AVG1’ has changed to ‘RESOLVING NORMAL’. The lease between the local availability replica and Windows Server Failover Clustering (WSFC) has expired, hence the state has changed. See the SQL Server error log, the Windows Server Failover Clustering (WSFC) administrative panel, or the WSFC log for further details.
Always On Availability Groups connection to secondary database ‘DAB1’ on availability replica ‘Nod1’ with Replica ID: 95d2878a-3fe0-48bf-aa2a-a41805fda786 has been terminated. This is a purely informative message. The user is not required to take any action.
Remote harden of transaction ‘user transaction’ (ID 0x000000002af6b5b4 0001:99ea0a50) in database ‘XXX’ at LSN (7544210:90297:16) initiated at Apr 17 2021 1:46AM failed.
Stopped listening on virtual network name ‘LISTN1’. No user action is required.
Process ID 250 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 16, object_id = 0.
Always On: Changing the state of the Windows Server Failover Clustering (WSFC) resource associated with availability group ‘AVG1’ to Failed. In the instance of SQL Server, the resource state does not match the availability group state. The local availability replica is the primary replica, according to the WSFC resource status, but it is not in the primary role. This is a purely informative message. There is no need for the user to take any action.
Always On: In response to a request from the Windows Server Failover Clustering (WSFC) cluster, the local replica of availability group ‘AVG1’ is prepared to shift to the resolving role. This is a purely informative message. There is no need for the user to take any action.
Not Unable to access availability database ‘DAB1’ because the database replica is not in the PRIMARY or SECONDARY role. Only when the database replica is in the PRIMARY or SECONDARY role is it possible to connect to an availability database. Later, try the operation again.
A connection for availability group ‘AVG1’ has been successfully created between availability replica ‘Nod1’ with id [3DB08082-C6D9-47E6-B53A-CCC8B15C21D2] and ‘Nod2’ with id [95D2878A-3FE0-48BF-AA2A-A41805FDA786]. This is a purely informative message. There is no need for the user to take any action.
According to the SQL Server Error Log, the problem was caused by a Lease Time Out that occurred on the Primary replica.
Due to intense resource use on the database server (100 percent CPU usage), SQL Server may become stalled or unresponsive for a period of time. This could have occurred as a result of a Lease timeout or expiration.
Collecting the required Data:
Follow the instructions below to find out how to set CPU Usage in Perfmon. After you’ve created a Data collection and enabled it with the Start button, let it run until you get a lease timeout error or the problem reappears. Stop the data collector now, open the log in Performance Monitor, and check the values of the Processor / percent Processor Time counter value to see if CPU utilization was reported high during the lease timeout error recorded in the error log.
I can see that an Optimization task was triggered at that time in SQL Server Agent jobs, however when the job started executing after a few minutes, it encountered a lease timeout issue, and the optimization operation was not completed successfully.
By this, I can say that when the Optimization job was initiated as per the schedule, the CPU utilization spiked to 100%, and there were blockings on the Database server, as seen by SPID being destroyed by SQL Server in an errrolog message, resulting in a Lease Timeout or Lease Expired issue.
The Lease Timeout Value is set to 20000 by default, however if the problem persists, we can increase the Lease Timeout Value from the Windows Server Failover Cluster ( WSFC ).
Open ‘ cluadim ‘ –> click on Roles in Left Pane–>Now On Right Pane click Availability Group–>Down you will see Resource TAB, click on it —>Right click on Listener Name –> click Properties–>Again on Properties TAB—> change the Lease Timeout value here —> then click OK .
However, adjusting the Lease Timeout Value will not solve your problem; it is only a band-aid solution. Need to figure out why there are blockings; additionally, try adjusting the Optimization job schedule and see what happens.