{"id":3917,"date":"2023-09-10T10:30:40","date_gmt":"2023-09-10T10:30:40","guid":{"rendered":"https:\/\/geekmungus.co.uk\/?p=3917"},"modified":"2023-09-10T10:30:40","modified_gmt":"2023-09-10T10:30:40","slug":"microsoft-exchange-2016-dag-failure-and-recovery","status":"publish","type":"post","link":"https:\/\/geekmungus.co.uk\/?p=3917","title":{"rendered":"Microsoft Exchange 2016 DAG Failure and Recovery"},"content":{"rendered":"\n<p>We run a number of Microsoft Exchange 2016 server (yes I know we need to upgrade), these consist of four, two node DAGs with a pair of shared witness servers. For the purposes of this explanation however, we have a two node Microsoft Exchange 2016 DAG, with a single DAG Witness Server (File Share).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Situation<\/h2>\n\n\n\n<p>We had an incident where unfortunately one of the nodes AND the Witness Server for that DAG both went offline at the same time.<\/p>\n\n\n\n<p>The result was that that DAG cluster became unavailable, which included mailbox databases on the surviving node.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Now What?<\/h2>\n\n\n\n<p>To recover the situation where were a bit stuck, we were unable to bring back up the failed Microsoft Exchange 2016 server (for a period of time), however we were able to bring back up the Witness Server for that DAG cluster.<\/p>\n\n\n\n<p>We&#8217;d assumed that would be enough, with a Majority Node set cluster, whoever is acting as the master just needs a majority, i.e. the ability to reach a majority of the nodes in the cluster, in this case 2 (of 3), so with the B side Exchange server (one vote) and the Witness Server (one vote), that should have been enough, however due to the way the cluster went offline this did not work.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Recovery<\/h2>\n\n\n\n<p>What was required was to force start the Cluster on the B side Exchange Server (the survivor) we did this with the following commands:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Stop-Service CluSvc \r\nNet start ClusSvc \/forcequorum<\/code><\/pre>\n\n\n\n<p>Once the cluster service was restarted, we could then start the DAG (Database Availability Group) and specify that the B side Exchange Server is the mailbox server, i.e. the Primary Active Master.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Start-DatabaseAvailabilityGroup \u2013Identity DAG1 \u2013MailboxServer dag1-server-b<\/code><\/pre>\n\n\n\n<p>Once started, we checked for the Primary Active Master with:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Get-DatabaseAvailabilityGroup \u2013Identity DAG1 \u2013Status | FL Name, PrimaryActiveManger<\/code><\/pre>\n\n\n\n<p>In our case this took around 2-3 minutes to appear, if your result is blank this shows that the DAG has no active manager, and that will explain your problems.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;PS] C:\\Windows\\system32>Get-DatabaseAvailabilityGroup -identity DAG1 -status | fl name, PrimaryActiveManager\r\nWARNING: Unable to get Primary Active Manager information due to an Active Manager call failure. Error: An Active\r\nManager operation failed. Error: The Microsoft Exchange Replication service may not be running on server\r\ndag1-server-a.domain.com. Specific RPC error message: Error 0x6ba (The RPC server is unavailable) from\r\ncli_GetPrimaryActiveManager &#91;Server: dag1-server-a.domain.com]\r\n\r\n\r\nName                 : DAG1\r\nPrimaryActiveManager :<\/code><\/pre>\n\n\n\n<p>Once assigned:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#91;PS] C:\\Windows\\system32>Get-DatabaseAvailabilityGroup -identity DAG1 -status | fl name, PrimaryActiveManager\r\n\r\r\nName                 : DAG1\r\nPrimaryActiveManager : DAG1-SERVER-B<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Close, But No Cigar<\/h2>\n\n\n\n<p>We were then able to start all the databases that normally resided on the B side Exchange Server, however we were unable to get the databases from the A side Exchange Server to start, these just said &#8220;unknown&#8221;.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>PS] C:\\Windows\\system32>Get-DatabaseAvailabilityGroup -identity dag1601 -status\r\n\r\nName             Member Servers                           Operational Servers\r\n----             --------------                           -------------------\r\nDAG1          {DAG1-SERVER-A, DAG1-SERVER-B}              {DAG1-SERVER-B}<\/code><\/pre>\n\n\n\n<p>Upon attempting to mount the databases that are normally on the A side Exchange Server we got errors such as:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>An Active Manager operation failed. Error The database action failed. Error: An error occurred while trying to validate the specified database copy for possible activation. Error: Database copy \u2018Database1\u2019 on server \u2018dag-server-a.domain.com\u2019 has a copy queue length of 9223372036854725486 logs, which is too high to enable automatic recovery. You can use the Move-ActiveMailboxDatabase cmdlet with the -SkipLagChecks and -MountDialOverride parameters to move the database with loss. If the database isn\u2019t mounted after successfully running Move-ActiveMailboxDatabase, use the Mount-Database cmdlet to mount the database.<\/code><\/pre>\n\n\n\n<p>Don&#8217;t worry too much about that massive number, you&#8217;ll see a number starting with &#8220;922337&#8230;.&#8221; but yours will be slightly different.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Get-MailboxServer | Get-MailboxDatabaseCopyStatus | Sort-Object Copyqueuelength<\/code><\/pre>\n\n\n\n<p>Essentially what this means is that Exchange thinks the database is so far out of date its not going to bring it live (the local copy of the database, i.e. the copy replicated from Server A to Server B), what we need to do here, is to force it live.<\/p>\n\n\n\n<p>WARNING! Depending on the state of your replica copy within the DAG at the time of the failure, you may suffer some data loss by bringing live the replica database on the B side Exchange server. This is your choice to make.<\/p>\n\n\n\n<p>Bring live each database in turn with:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Move-ActiveMailboxDatabase database1 -ActivateOnServer dag1-server-b -SkipHealthChecks -SkipActiveCopyChecks -SkipClientExperienceChecks -SkipLagChecks -MountDialOverride:BESTEFFORT<\/code><\/pre>\n\n\n\n<p>You&#8217;ll see something like:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>PS] C:\\Windows\\system32>Move-ActiveMailboxDatabase DAG1 -ActivateOnServer dag1-server-b -skiphealthchecks -skipact\r\nivecopychecks -SkipClientExperienceChecks -SkipLagChecks -MountDialOverride:BESTEFFORT\r\n\r\nConfirm\r\nMoving mailbox database \"database1\" from server \"dag1-server-a.domain.com\" to server\r\n\"dag1-server-b.domain.com\".\r\n&#91;Y] Yes  &#91;A] Yes to All  &#91;N] No  &#91;L] No to All  &#91;?] Help (default is \"Y\"): y\r\n\r\nIdentity        ActiveServerAtS ActiveServerAtE Status     NumberOfLogsLost   RecoveryPoint MountStatus MountStatus\r\n                tart            nd                                            Objective     AtMoveStart AtMoveEnd\r\n--------        --------------- --------------- ------     ----------------   ------------- ----------- -----------\r\ndatabase1         dag1-server-a dag1-server-b Succeeded  1                  04\/09\/2023    Dismounted  Mounted<\/code><\/pre>\n\n\n\n<p>Repeat on each database until they are all mounted. At this point all the A side Exchange Server&#8217;s databases should be mounted on the B side Exchange Server; essentially all that DAG&#8217;s databases are now mounted and running from the B side Exchange Server (i.e. the one that survived).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Optional Clean-Up<\/h3>\n\n\n\n<p>You may need to clean up but this you&#8217;ll need to judge at the time, you&#8217;ll need the failed Exchange Server back online, i.e. in this case A side Exchange Server, but if it is permanently gone, you&#8217;ll need to clean up and then work out a way to move mailboxes and bring everything back to a good state again at a later time. Repeat for all affected databases (on the failed node).<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Remove-MailboxDatabaseCopy -Identity database1\\dag1-server-a -Confirm:$False<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">While We Recover<\/h2>\n\n\n\n<p>So now we have the mailbox databases back online, we can then go to what we need to do to ensure it is all safe. Depending on your situation this will differ, but in our case the A side Exchange Server had suffered a power failure, so once the power was restored we could just fire it back up. However just to be sure we set the &#8220;ActivationPreference&#8221; for its databases to the B side Exchange Server at least until the problem has passed.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>set-mailboxdatabasecopy -identity DAG1\\dag1-server-b -activationpreference 1<\/code><\/pre>\n\n\n\n<p>Of course, you&#8217;ll need to return this to normal to ensure both your A and B side Exchange Servers are running balanced, its the same command just set the appropriate preference for the server as you see fit.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Last Steps<\/h2>\n\n\n\n<p>The final step was to power on the failed A side Exchange Server, once powered on it rejoined the cluster being that it could see both other members, and being that it didn&#8217;t have the majority stayed as the standby member.<\/p>\n\n\n\n<p>We then needed to ensure the database copies were all up to date and in-sync, then failover the relevant databases to ensure we had the original 50\/50 split of databases.<\/p>\n\n\n\n<p>And reset the &#8220;ActivationPreference&#8221; setting, in the previous section according to where we wanted the databases to typically be.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>As is common my articles here are built on reading others articles and merging the solutions together for my particular situation, please see the below links, they were helpful for me to piece together a solution and may also help you if your specific situation differed from mine!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Additional Information<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.systoolsgroup.com\/updates\/database-availability-group-must-have-quorum\/\">https:\/\/www.systoolsgroup.com\/updates\/database-availability-group-must-have-quorum\/<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/community.spiceworks.com\/how_to\/166108-solve-error-active-manager-is-in-an-unknown-state-on-exchange-server \">https:\/\/community.spiceworks.com\/how_to\/166108-solve-error-active-manager-is-in-an-unknown-state-on-exchange-server <\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.nucleustechnologies.com\/blog\/exchange-2016-database-status-unknown-error\/\">https:\/\/www.nucleustechnologies.com\/blog\/exchange-2016-database-status-unknown-error\/<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.alitajran.com\/dag-activation-preference-behavior-change-in-exchange-2016-cu2-and-higher\/\">https:\/\/www.alitajran.com\/dag-activation-preference-behavior-change-in-exchange-2016-cu2-and-higher\/<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.techieshelp.com\/exchange-dag-node-failure-force-switchover-queus\/\">https:\/\/www.techieshelp.com\/exchange-dag-node-failure-force-switchover-queus\/<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/shanejacksonitpro.com\/2015\/10\/22\/microsoft-exchange-dag-database-copy-queue-length-9223372036854773269\/\">https:\/\/shanejacksonitpro.com\/2015\/10\/22\/microsoft-exchange-dag-database-copy-queue-length-9223372036854773269\/<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>We run a number of Microsoft Exchange 2016 server (yes I know we need to upgrade), these consist of four, two node DAGs with a pair of shared witness servers. For the purposes of this explanation however, we have a two node Microsoft Exchange 2016 DAG, with a single DAG Witness Server (File Share). Situation &#8230; <a title=\"Microsoft Exchange 2016 DAG Failure and Recovery\" class=\"read-more\" href=\"https:\/\/geekmungus.co.uk\/?p=3917\" aria-label=\"Read more about Microsoft Exchange 2016 DAG Failure and Recovery\">Read more<\/a><\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[32],"tags":[],"class_list":["post-3917","post","type-post","status-publish","format-standard","hentry","category-microsoft-exchange"],"_links":{"self":[{"href":"https:\/\/geekmungus.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/3917","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/geekmungus.co.uk\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/geekmungus.co.uk\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/geekmungus.co.uk\/index.php?rest_route=\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/geekmungus.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3917"}],"version-history":[{"count":3,"href":"https:\/\/geekmungus.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/3917\/revisions"}],"predecessor-version":[{"id":3920,"href":"https:\/\/geekmungus.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/3917\/revisions\/3920"}],"wp:attachment":[{"href":"https:\/\/geekmungus.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3917"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/geekmungus.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3917"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/geekmungus.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3917"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}