Exchange 2007: Always On

Published by:
Written by: J. Peter Bruzzese

Exchange 2007 comes equipped to let you use several high availability techniques.

Most organizations need their data available every second of every day. Unfortunately, computers, networks and storage devices will all fail eventually — no matter how much we pay for them or how closely we monitor them.

Disaster recovery solutions are our most common defense against such technology failures. However, these only let you restore your data back to the point of disaster. Data and time are bound to be lost, and even if the time lost is minimal, time lost is still money lost.

The catch phrase these days for keeping systems running is "High Availability." The promise of high availability solutions is 24/7 uptime — or, more accurately, no unscheduled downtime. There are three primary options for high availability with Exchange Server 2007: Local Continuous Replication, Cluster Continuous Replication and Single Copy Clusters.

Understanding these options will give you a better vision of what you can provide your organization when you're deploying Exchange Server 2007 out of the box. These solutions all provide varying degrees of high availability, so not all solutions are equal-and not all solutions involve clustering. Clustering is often synonymous with the concept of high availability, but it's no longer an essential ingredient.

First Line of Defense: Transaction Logs

Exchange makes a valiant attempt to provide its own redundancy right out of the box. Part of Exchange's overall architecture includes storage groups. Each of these storage groups contains several databases. Exchange 2007 Standard Edition lets you create up to five storage groups and mount up to five databases. Exchange 2007 Enterprise Edition lets you create up to 50 storage groups and mount up to 50 databases.

When you install Exchange in the mailbox server role, you'll find one default storage group, which contains one default mailbox database (typically labeled Mailbox Database.edb, as database files use .EDB extensions). For each new database you add, you increase the number of .EDB files within a storage group. You could also create additional storage groups with additional databases.

Transaction logs help keep each database up-to-date. When data comes into the Exchange server, typically as an e-mail message, it enters the system memory. From memory, it's written to a transaction log. Each log reaches a maximum size of 1MB (a reduction from 5MB in Exchange 2003). These transaction logs are eventually added to the database that stores the mailbox for the intended recipient.

There's a check file that keeps track of which transaction logs have been updated into the database. The benefit here is that you have redundancy, although it acts as protection for you only if you go through the effort of separating the disk location of your logs and database.

This allows for better performance and proper disaster recovery. In the event that the database is corrupted or the disk carrying the database crashes, those transaction logs are invaluable. You can combine them with the latest backup to restore your system. Understanding how the transaction logs and the database work together is essential to understanding these high availability solutions for Exchange 2007.

High Stakes, High Availability

As mentioned earlier, there are three primary high availability options beyond your ability to structure your database and transaction logs for improved performance and availability. Placing your database on a Raid 5 disk structure and mirroring your transaction logs is a recommended practice, but even that won't prevent database corruption.

If you want to go beyond the standard recommendations, consider something like Local Continuous Replication (LCR). LCR is a single-server solution that uses asynchronous log shipping and replay from one set of disks to another.

So what does asynchronous log shipping mean? When you first establish an LCR — or even Cluster Continuous Replication (CCR), for that matter (more on that later) — it makes a copy of the database. Transaction logs keep the database up-to-date from that point forward. A log is closed once it has been entered into the database. You can have it shipped over to the second disk and replayed into the secondary copy of the database.

The caveat here is that this time divergence means the secondary copy can't be 100 percent in sync with its primary. This means is you have the potential to lose some data, depending on when a failure occurs.

Although it's often called the "poor man's cluster," LCR isn't technically a cluster. For those of you familiar with mirror sets, the concept here is similar. It's based primarily on your chosen Exchange storage group.

You can create an LCR set when you create a new storage group. You can also create one for an existing storage group. You establish the LCR through the GUI with the Exchange Management Console or even through PowerShell commands from within your Exchange Management Shell. In the event that one disk crashes or the database is corrupted, you can switch over to the secondary copy of the data by typing in a manual switch. Keep in mind that this is an inexpensive solution that you can do from Standard Windows Server 2003.

Clustering to the Rescue

CCR is a clustered solution that allows for two nodes in a cluster-one is the active node and the other is the passive node for automatic failover. Both nodes must be servers with the Exchange 2007 Mailbox role installed.

The benefit here is that you eliminate single points of failure because there are two unique systems with two sets of storage. This offers a higher level of availability than an LCR set. The caveat here is that you'll need to invest more money in hardware (for the extra system) and software (because to perform the cluster you'll need to be running the Enterprise Edition of Windows Server 2003). This type of solution also uses asynchronous log shipping and replay to keep the database up-to-date between the active and passive copy of the data.

To fully understand the way CCR works, visualize the two servers. The active server has a network connection to the public network. The passive server does, as well. Between the two nodes, however, is a private network connection on a separate network-addressing scheme that carries the "heartbeat" signal between them.

The passive server waits patiently, as long as it receives a heartbeat from the active node saying "I'm alive." It then responds back that it, too, is alive. You can configure the cycle of these heartbeats, but by default they're sent every 1.2 seconds from each cluster node.

If the passive server doesn't receive a heartbeat (which could happen for any number of reasons), it starts getting edgy and eager to become active. If, however, it did become active while the other server was also active, it could cause a problem known as split-brain syndrome. To prevent this problem, there's a quorum (called a Majority Node Set, or MNS quorum) that maintains a share file witness between these two servers.

This is held on a third server (typically the Hub Transport server of the same Active Directory site as the passive and active nodes), and makes the final determination for the passive node if indeed the active is alive and well. In the event that the active server is actually down, the passive server will automatically come to life and assume the workload.

While asynchronous log shipping (also used in CCR) may involve some data loss, there's another process that can prevent this loss when used in a CCR set. On the Hub Transport server, there's a feature you can configure called the Transport Dumpster. This retains a predetermined amount of mail-message data before delivering it to the cluster.

If the active node goes dead and the passive node jumps in, one of the first orders of business is for the "new" active server to check in with all Hub Transport servers and request any mail data it may not have received. This new active server will double-check all incoming data. It will retain any new messages and discard duplicates. This ensures a greater degree of high availability than LCR.

One-Stop Shopping

Single Copy Clusters (SCCs) are similar in design to the high availability solution available in Exchange 2003. You have a two-node cluster that relies on a single-storage location. This type of solution provides system redundancy, but requires that you provide your own storage redundancy (which could be a NAS or SAN with RAID-level redundancy).

In Exchange 2003, you could configure an active/active mode where both servers were active simultaneously. This solution was so problematic that instead of being updated and enhanced for Exchange 2007 it was discontinued. SCC works with the active/ passive configuration. To evaluate this solution on cost, keep in mind that SCC requires two systems, a RAID-enabled storage solution and the Enterprise Edition of Windows Server 2003.

While LCR, CCR and SCC are the three primary options, the Exchange development team has announced it will release another solution with Exchange 2007 Service Pack 1 later this year.

"With Standby Continuous Replication [SCR], data can be replicated on a per-storage group basis to standby servers or clusters," according to the Exchange development team. "The SCR target, whether a single mailbox server or a cluster, can be placed inside the primary data center or in a remote location, ready to be manually activated if the primary server or data center fails." Stay tuned for more on this development.

Which Way To Go?

Making the right decision of which approach is best for your environment is a tough one. You need to weigh the cost of high availability against your needs. You may decide a third-party solution is worth the added cost for even higher availability. Whichever method you choose, rest assured that Exchange 2007 has been designed to make any high availability solution easy to execute.