Microsoft Exchange 2016 DAG Failure and Recovery

We run a number of Microsoft Exchange 2016 server (yes I know we need to upgrade), these consist of four, two node DAGs with a pair of shared witness servers. For the purposes of this explanation however, we have a two node Microsoft Exchange 2016 DAG, with a single DAG Witness Server (File Share).

Situation

We had an incident where unfortunately one of the nodes AND the Witness Server for that DAG both went offline at the same time.

The result was that that DAG cluster became unavailable, which included mailbox databases on the surviving node.

Now What?

To recover the situation where were a bit stuck, we were unable to bring back up the failed Microsoft Exchange 2016 server (for a period of time), however we were able to bring back up the Witness Server for that DAG cluster.

We’d assumed that would be enough, with a Majority Node set cluster, whoever is acting as the master just needs a majority, i.e. the ability to reach a majority of the nodes in the cluster, in this case 2 (of 3), so with the B side Exchange server (one vote) and the Witness Server (one vote), that should have been enough, however due to the way the cluster went offline this did not work.

Recovery

What was required was to force start the Cluster on the B side Exchange Server (the survivor) we did this with the following commands:

Stop-Service CluSvc 
Net start ClusSvc /forcequorum

Once the cluster service was restarted, we could then start the DAG (Database Availability Group) and specify that the B side Exchange Server is the mailbox server, i.e. the Primary Active Master.

Start-DatabaseAvailabilityGroup –Identity DAG1 –MailboxServer dag1-server-b

Once started, we checked for the Primary Active Master with:

Get-DatabaseAvailabilityGroup –Identity DAG1 –Status | FL Name, PrimaryActiveManger

In our case this took around 2-3 minutes to appear, if your result is blank this shows that the DAG has no active manager, and that will explain your problems.

[PS] C:\Windows\system32>Get-DatabaseAvailabilityGroup -identity DAG1 -status | fl name, PrimaryActiveManager
WARNING: Unable to get Primary Active Manager information due to an Active Manager call failure. Error: An Active
Manager operation failed. Error: The Microsoft Exchange Replication service may not be running on server
dag1-server-a.domain.com. Specific RPC error message: Error 0x6ba (The RPC server is unavailable) from
cli_GetPrimaryActiveManager [Server: dag1-server-a.domain.com]


Name                 : DAG1
PrimaryActiveManager :

Once assigned:

[PS] C:\Windows\system32>Get-DatabaseAvailabilityGroup -identity DAG1 -status | fl name, PrimaryActiveManager


Name                 : DAG1
PrimaryActiveManager : DAG1-SERVER-B

Close, But No Cigar

We were then able to start all the databases that normally resided on the B side Exchange Server, however we were unable to get the databases from the A side Exchange Server to start, these just said “unknown”.

PS] C:\Windows\system32>Get-DatabaseAvailabilityGroup -identity dag1601 -status

Name             Member Servers                           Operational Servers
----             --------------                           -------------------
DAG1          {DAG1-SERVER-A, DAG1-SERVER-B}              {DAG1-SERVER-B}

Upon attempting to mount the databases that are normally on the A side Exchange Server we got errors such as:

An Active Manager operation failed. Error The database action failed. Error: An error occurred while trying to validate the specified database copy for possible activation. Error: Database copy ‘Database1’ on server ‘dag-server-a.domain.com’ has a copy queue length of 9223372036854725486 logs, which is too high to enable automatic recovery. You can use the Move-ActiveMailboxDatabase cmdlet with the -SkipLagChecks and -MountDialOverride parameters to move the database with loss. If the database isn’t mounted after successfully running Move-ActiveMailboxDatabase, use the Mount-Database cmdlet to mount the database.

Don’t worry too much about that massive number, you’ll see a number starting with “922337….” but yours will be slightly different.

Get-MailboxServer | Get-MailboxDatabaseCopyStatus | Sort-Object Copyqueuelength

Essentially what this means is that Exchange thinks the database is so far out of date its not going to bring it live (the local copy of the database, i.e. the copy replicated from Server A to Server B), what we need to do here, is to force it live.

WARNING! Depending on the state of your replica copy within the DAG at the time of the failure, you may suffer some data loss by bringing live the replica database on the B side Exchange server. This is your choice to make.

Bring live each database in turn with:

Move-ActiveMailboxDatabase database1 -ActivateOnServer dag1-server-b -SkipHealthChecks -SkipActiveCopyChecks -SkipClientExperienceChecks -SkipLagChecks -MountDialOverride:BESTEFFORT

You’ll see something like:

PS] C:\Windows\system32>Move-ActiveMailboxDatabase DAG1 -ActivateOnServer dag1-server-b -skiphealthchecks -skipact
ivecopychecks -SkipClientExperienceChecks -SkipLagChecks -MountDialOverride:BESTEFFORT

Confirm
Moving mailbox database "database1" from server "dag1-server-a.domain.com" to server
"dag1-server-b.domain.com".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): y

Identity        ActiveServerAtS ActiveServerAtE Status     NumberOfLogsLost   RecoveryPoint MountStatus MountStatus
                tart            nd                                            Objective     AtMoveStart AtMoveEnd
--------        --------------- --------------- ------     ----------------   ------------- ----------- -----------
database1         dag1-server-a dag1-server-b Succeeded  1                  04/09/2023    Dismounted  Mounted

Repeat on each database until they are all mounted. At this point all the A side Exchange Server’s databases should be mounted on the B side Exchange Server; essentially all that DAG’s databases are now mounted and running from the B side Exchange Server (i.e. the one that survived).

Optional Clean-Up

You may need to clean up but this you’ll need to judge at the time, you’ll need the failed Exchange Server back online, i.e. in this case A side Exchange Server, but if it is permanently gone, you’ll need to clean up and then work out a way to move mailboxes and bring everything back to a good state again at a later time. Repeat for all affected databases (on the failed node).

Remove-MailboxDatabaseCopy -Identity database1\dag1-server-a -Confirm:$False

While We Recover

So now we have the mailbox databases back online, we can then go to what we need to do to ensure it is all safe. Depending on your situation this will differ, but in our case the A side Exchange Server had suffered a power failure, so once the power was restored we could just fire it back up. However just to be sure we set the “ActivationPreference” for its databases to the B side Exchange Server at least until the problem has passed.

set-mailboxdatabasecopy -identity DAG1\dag1-server-b -activationpreference 1

Of course, you’ll need to return this to normal to ensure both your A and B side Exchange Servers are running balanced, its the same command just set the appropriate preference for the server as you see fit.

Last Steps

The final step was to power on the failed A side Exchange Server, once powered on it rejoined the cluster being that it could see both other members, and being that it didn’t have the majority stayed as the standby member.

We then needed to ensure the database copies were all up to date and in-sync, then failover the relevant databases to ensure we had the original 50/50 split of databases.

And reset the “ActivationPreference” setting, in the previous section according to where we wanted the databases to typically be.

Conclusion

As is common my articles here are built on reading others articles and merging the solutions together for my particular situation, please see the below links, they were helpful for me to piece together a solution and may also help you if your specific situation differed from mine!

Microsoft Exchange 2016 DAG Failure and Recovery

Situation

Now What?

Recovery

Close, But No Cigar

Optional Clean-Up

While We Recover

Last Steps

Conclusion

Additional Information

1 thought on “Microsoft Exchange 2016 DAG Failure and Recovery”

Leave a Reply Cancel reply

More Adventures in Ansible – Bind Example

Simple Method to Tree List Directories and Files

Using Git with HTTPS

Situation

Now What?

Recovery

Close, But No Cigar

Optional Clean-Up

While We Recover

Last Steps

Conclusion

Additional Information

Related Posts

Exchange 2003 SP2 – Event ID 1159 (and others)

Exchange 2016 – Security Update For Exchange Server 2016 CU22 (KB5015322) – OWA Failed After Update

Exchange 2010 to Exchange 2013 Migration – Outlook Anywhere Clients Username/Password Prompt Appears

1 thought on “Microsoft Exchange 2016 DAG Failure and Recovery”

Leave a Reply Cancel reply