Jump to content

Admin

  • entries
    124
  • comments
    78
  • views
    34817

Recommended Windows Hotfix for Database Availability Groups running Windows Server 2008 R2


i-away

757 views

 Share

In early August of this year, the Windows SE team released the
following Knowledge Base (KB) article and accompanying software hotfix
regarding an issue in Windows Server 2008 R2 failover clusters:


KB2550886 - A transient communication failure causes a Windows Server 2008 R2 failover cluster to stop working


This hotfix is strongly recommended for all databases availability
groups that are stretched across multiple datacenters. For DAGs that are
not stretched across multiple datacenters, this hotfix is good to have,
as well. The article describes a race condition and cluster database
deadlock issue that can occur when a Windows Failover cluster encounters
a transient communication failure. There is a race condition within the
reconnection logic of cluster nodes that manifests itself when the
cluster has communication failures. When this occurs, it will cause the
cluster database to hang, resulting in quorum loss in the failover
cluster.


As described on TechNet,
a database availability group (DAG) relies on specific cluster
functionality, including the cluster database. In order for a DAG to be
able to operate and provide high availability, the cluster and the
cluster database must also be operating properly.


Microsoft has encountered scenarios in which a transient network
failure occurs (a failure of network communications for about 60
seconds) and as a result, the entire cluster is deadlocked and all
databases are within the DAG are dismounted. Since it is not very easy
to determine which cluster node is actually deadlocked, if a failover
cluster deadlocks as a result of the reconnect logic race, the only
available course of action is to restart all members within the entire
cluster to resolve the deadlock condition.


The problem typically manifests itself in the form of cluster quorum
loss due to an asymmetric communication failure (when two nodes cannot
communicate with each other but can still communicate with other nodes).
If there are delays among other nodes in the receiving of cluster
regroup messages from the cluster’s Global Update Manager (GUM), regroup
messages can end up being received in unexpected order. When that
happens, the cluster loses quorum instead of invoking the expected
behavior, which is to remove one of the nodes that experienced the
initial communication failure from the cluster.


Generally, this bug manifests when there is asymmetric latency (for
example, where half of the DAG members have latency of 1 ms, while the
other half of the DAG members have 30 ms latency) for two cluster nodes
that discover a broken connection between the pair. If the first node
detects a connection loss well before the second node, a race condition
can occur:


  • The first node will initiate a reconnect of the stream between the
    two nodes. This will cause the second node to add the new stream to its
    data.
  • Adding the new stream tears down the old stream and sets its failure
    handler to ignore. In the failure case, the old stream is the failed
    stream that has not been detected yet.
  • When the connection break is detected on the second node, the second
    node will initiate a reconnect sequence of its own. If the connection
    break is detected in the proper race window, the failed stream's failure
    handler will be set to ignore, and the reconnect process will not
    initiate a reconnect. It will, however, issue a pause for the send
    queue, which stops messages from being sent between the nodes. When the
    messages are stopped, this prevents GUM from operating correctly and
    forces a cluster restart.

If this issue does occur, the consequences are very bad for DAGs. As a
result, we recommend that you deploy this hotfix to all of your Mailbox
servers that are members of a DAG, especially if the DAG is stretched
across datacenters. This hotfix can also benefit environments running
Exchange 2007 Single Copy Clusters and Cluster Continuous Replication
environments.


In addition to fixing the issue described above, KB2550886 also
includes other important Windows Server 2008 R2 hotfixes that are also
recommended for DAGs:


 Share

0 Comments


Recommended Comments

There are no comments to display.

Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...