Node Failure Recovery
Node failures may be related to hardware, software, or the operating system.
This node failure recovery process applies only to Teradata Database MPP instances. For Teradata Database single-node systems, recovering failed nodes is automatic in the AWS environment. Teradata spins up a replacement node and the instance automatically restarts and is configured on another physical server. The PDE will be in a NODESTART state while the node is recovering. After the system is back up, you will see that logons are enabled. If a node hangs on a single-node system, it does not recover automatically. The instance must be stopped and started again from the AWS Web Console.
For Teradata Database MPP instances, the replacement node is either based off an image recorded when the system is first launched, or an updated image after a software upgrade. The replacement node has the same secondary IP, elastic IP, and identifiers as the replaced node. The primary IP address of the replacement node will be new. However, if you had allocated an elastic IP address to each node when you launched your instance, the public IP address should be the same after the node recovery. The secondary private IP addresses will be the same regardless of the elastic IP address settings you chose when you launched the instance. At least one free IP address in the subnet is required.
You must create a new restore image after a software upgrade. If you do not create this image, the software on the recovered instance will not match and the database cannot start if a node failure occurs. For information, see Creating a Restore Image.
Node failure recovery is handled differently than on-premises systems. Unless you want failed nodes to continue running for diagnostic purposes, you should terminate the instance. For information, see Stopping or Terminating Teradata Database Instances. If AWS Support cannot diagnose the issue, contact Teradata Customer Support.
There is a cost associated with keeping this downed node running. You have the option of stopping or terminating the instance after the new node is brought online. For more information, see Amazon User Guide for Linux Instances. You can also configure the instance to stop or terminate before a node failure. For more information, see Configuring the Instance State for Node Failure Recovery.
Node failure recovery takes longer than a typical reset. There are dozens of reasons for a node failure and it may be difficult for you to determine the cause. However, if your instance does not automatically restart after 10 minutes, first check the alerts in Teradata Server Management or Teradata Viewpoint. For additional assistance, contact Teradata Customer Support.
When using EBS Storage
When a node fails, a virtual hot standby node instance automatically spins up, detaches the EBS storage of the failed node, configures a new instance, reattaches the EBS storage to the new instance, and the configuration is reinstated.
If a node fails and another instance cannot be provisioned in the same placement group, or if you have issues with EBS data in the EBS volume, the system continues to run on Fallback. Contact Teradata Customer Support to spin up a new instance in the placement group and start Fallback recovery.
If more than one node fails at the same time, only one node is recovered automatically; however, the system still runs on Fallback. If more than two nodes fail at the same time, the system can only run on Fallback if the two remaining failed nodes have AMPs that are not in the same fallback cluster. If the two failed nodes have AMPs in the same fallback cluster, you can recover the system by spinning up replacement instances and reattaching the EBS storage.
When using Local Storage
Node failures are handled differently on local storage instance types. When a node fails, the data is lost. Although the node is replaced and comes back online, the AMPs on the recovered instance display as FATAL and offline. The other vprocs on the system are online and in the configuration. To fully restore a local storage instance, you must rebuild the AMPs. Until the AMPs are rebuilt, data is maintained in fallback rows on the other AMPs. For more information, see Rebuilding AMPs after Failure.
Before a Node Failure Occurs
Before a node failure occurs, set up alerts using the Server Management portlet. For example, create and enable an alert named TD Database Restart with a SM_LOG alert action set so that you will be notified when the database restarts after a node failure. When an alert occurs, it can be viewed in the Alert Viewer portlet in Teradata Viewpoint. For information on setting up email alerts, see the Teradata Server Management Web Services User Guide.
Ensure there is at least one free IP address in the subnet before launching an instance.