High Availability Data Replication
Author: Phil Rule
What is it?
High Availability Data Replication (known as HDR) is intrinsic functionality within Informix On-Line which enables the replication of one instance to another, to add resilience. Should the master (or primary) instance fail the replicated (or secondary) can be switched to read-write (or standard) mode, either automatically or manually.
The secondary instance can also be used as a read-only instance whilst replication is running, perhaps processing reports to take some load from the primary.
What isn't it?
It isn't Enterprise Replication, which allows databases to be distributed across multiple servers. This is designed for scenarios such as regional offices having their own servers and instances, with all data replicated to a head office instance - each regional database contains only local data, but the h/o database is a read-only (or read-write) composite of all the regional databases. This does not add resilience, nor is it designed to. In version 9.40 it is possible to mix the two, e.g. the primary server can be part of an Enterprise Replication topography, the earlier releases do not allow such a setup.
How does it work?
Essentially the secondary server is in permanent logical recovery, it has done a physical recovery from the primary's level 0 archive, and is continually rolling forward logical log activity from the primary instance. Logical log entries are copied to the replication buffer, and this is transferred across the network to the secondary. It therefore follows that only loggable activity is replicated, if there are any unlogged databases on the primary server they will not be replicated to the secondary. Neither will any blobs which are not held within blob spaces (smart or otherwise).
What do you need?
The server
Another server, with the same version of the Informix engine installed. The two servers must be at least the same manufacturer, o/s version and architecture. One could use a 4-way Sun E450 as the primary, and a 2-way E280 as the secondary, but not a E450 primary and a Sparcstation 20 as the secondary. It's quite common to use a somewhat lower specification server for the secondary, just make sure it can keep up!
The instance
The dbspaces must be similarly configured on the two servers, since the first step in initiating HDR is to restore a Level 0 archive from the primary onto the secondary. However, whilst the same link for each chunk must exist, that link can point to anything, as long as it is large enough to contain the chunk. For instance, the root dbspace on the primary may be a 1Gb raw disk partition pointed to by the link $INFORMIXDIR/links/rootdbs, but the same link on the secondary could point to a cooked file. You do use links for the chunk pathnames, don't you? Make sure that the TAPE and LTAPE parameters are the same between the two instances . you will be restoring archives and perhaps log tapes written on the primary, and HDR will not start to initiate replication if the block sizes or capacities differ.
The network
The two servers must be able to see each other on the network, and be able to connect to each other. If there is a firewall in the way, this must be opened up to allow traffic on the tcp ports used by the two instances. Once the restore to the secondary has been done (or there is any sort of instance on the secondary), bring it online, and use dbaccess on the primary to connect across the network to the secondary, and vice versa. Even though you may be prompted for a username/password, if you see the list of databases on the remote server the connectivity is OK.
It's also worth checking the volume of logical logs you get through, it is essentially this which is transferred from primary to secondary and you will need sufficient bandwidth.
Initiating Replication.
Using ontape even if the normal backup strategy is onbar (we're assuming here no-one still uses Onarchive!) it is far easier to do this with ontape. See below for additional info on using onbar for this. On the primary run onstat -g dri. This will tell us the current HDR status. You should see something like this .
Informix Dynamic Server Version 9.30.UC1 -- On-Line -- Up 3 days 22:37:14 -- 159744 Kbytes Data Replication: Type State Paired server Last DR CKPT (id/pg) standard off -1 / -1 DRINTERVAL 30 DRTIMEOUT 30 DRLOSTFOUND /export/home/informix/etc/dr.lostfound
If Last DR CKPT (id/pg) has any values replication is unlikely to start run onmode -d standard and restart the instance. If you don't do this you are likely to get all sorts of errors. Start with the level 0 restore. It doesn't have to be the latest the main criteria are that the configuration on the secondary will take the restore (same chunk links, compatible onconfig file etc.) and that the secondary is able to access all logical logs backed up from the primary since the level 0 was taken. The command to use is ontape -p, since we do not want to bring the instance online. Don't ask it to back up the logs, and don't restore a level 1 tape . replication can only start from a Level 0.
Once you get program over you can start replication (you can do this earlier, but you'll just fill the log files with errors). Run onmode -d primary secondary_server_name on the primary, and onmode -d secondary primary_server_name on the secondary. Check the online log on the secondary. If there is a message DR: Start failure recovery from tape … you need to recover logs from tape. The primary is unable to start sending logs completed since the level 0 because the one it needs to start with has been backed up. On the secondary log tape device mount the tape containing that log, and run ontape -l this will start the log recovery. This is why it is essential that the tape parameters match between the servers. If more than one log archive has been done since the level 0 keep loading the tapes and continuing the restore.
If no log tapes need restoring, or after all that are required have been restored, you will see DR: Failure recovery from disk in progress in the secondary online log. Now the primary is sending all non-archived log files to the secondary, which is in turn rolling forward. In due course each server will confirm that DR is up and running i.e. DR: Primary server operational and DR: Secondary server operational will appear in the respective log files. The banner from onstat commands will show the status On-Line (Prim) or On-Line Read-Only (Sec). The secondary server is now accessible exactly as though it were a normal server, except that it is read-only. For normal operation no further action need be taken on either server.
Restarting Replication
Depends entirely on why it's failed in the first place. One thing that is common to almost all scenarios is to make sure the secondary doesn't come up as standard. This almost certainly means restarting from a Level 0.
Loss of network connection Later versions of the engine will try to reconnect every few seconds . this can cause the log file to become quite large. But it does mean that as soon as the network is back replication will resume.
HDR is a bit sensitive over VPNs, and has been known to reset the odd firewall parameter, blocking it's own traffic.
Other reasons
Assuming the logical logs on the primary haven't wrapped round, i.e. the last one rolled forward by the secondary has not yet been overwritten, there's a good chance HDR will resume. Shut down the secondary, and restart it in quiescent mode. Switch the primary back to standard, then primary again. You may have to try a combination of these.
Common 'gotchas'
You cannot change the logging type of an unlogged database whilst replication is in progress. ondblog will just ignore the command, but if you use ontape you'll get an error. You have to break replication on the primary, set the database logging as required and then do a Level 0 archive. Restore this archive on the secondary, and re-initialise replication. Basically Online is not prepared to copy an entire database via HDR.
Adding chunks/spaces
This is a common one! Make sure that the raw devices and links are set up on the secondary as well as the primary, and run the onspaces (or onmonitor) command on the primary only . the command is then replicated over the network to the secondary, and the space/chunk is added automatically.