Node Failure Recovery
Node failures may be related to hardware, software, or the operating system.
This node failure recovery process applies only to Teradata Database MPP instances. For Teradata Database single-node systems, recovering failed nodes is automatic in the AWS environment. Teradata spins up a replacement node and the instance automatically restarts and is configured on another physical server. The PDE will be in a NODESTART state while the node is recovering. After the system is back up, you will see that logons are enabled. If a node hangs on a single-node system, it does not recover automatically. The instance must be stopped and started again from the AWS Web Console.
For Teradata Database MPP instances, the replacement node is either based off an image recorded when the system is first deployed, or an updated image after a software upgrade. The replacement node has the same secondary IP, elastic IP, and identifiers as the replaced node. The primary IP address of the replacement node will be new. However, if you had allocated an elastic IP address to each node when you deployed your instance, the public IP address should be the same after the node recovery. The secondary private IP addresses will be the same regardless of the elastic IP address settings you chose when you deployed the instance. At least one free IP address in the subnet is required.
You must create a new system image after a software upgrade. If you do not create this image, the software on the recovered instance will not match and the database cannot start if a node failure occurs.
Node failure recovery is handled differently than on-premises systems. Unless you want failed nodes to continue running for diagnostic purposes, you should terminate the instance. If AWS Support cannot diagnose the issue, contact Teradata Customer Support.
There is a cost associated with keeping this downed node running. You have the option of stopping or terminating the instance after the new node is brought online. See Amazon User Guide for Linux Instances. You can also configure the instance to stop or terminate before a node failure.
Node failure recovery takes longer than a typical reset. There are dozens of reasons for a node failure and it may be difficult for you to determine the cause. However, if your instance does not automatically restart after 10 minutes, first check the alerts in Teradata Server Management or Teradata Viewpoint. For assistance, contact Teradata Customer Support.
Before a Node Failure Occurs
Before a node failure occurs, ensure there is at least one free IP address in the subnet before deploying an instance. Also, set up alerts using the Server Management portlet. For example, create and enable an alert named TD Database Restart with a SM_LOG alert action set so that you will be notified when the database restarts after a node failure. When an alert occurs, it can be viewed in the Alert Viewer portlet in Teradata Viewpoint. See the Teradata® Server Management Web Services User Guide.
When Multiple Node Failures Occur
When two or more nodes fail at the same time, all nodes can be replaced at the same time as long as one node remains running to act as the node failure recovery control node. However, if both BYNET relay nodes in the database are unavailable during node failure recovery, the database stops. If one or more replacement instances cannot be spun up, the database is brought back up running on Fallback and logons are enabled. If there are two nodes in the same fallback cluster for the AMPs associated with these two nodes, logons are enabled, but queries fail because they reference data from the same fallback cluster where both AMPs are inaccessible. Queries that do not reference data from the failed fallback clusters still work.
If two or more nodes fail at different times while node failure recovery is in progress, contact Teradata Customer Support.
Replacing a Node When a Teradata Database System is Running on Fallback
In Teradata Software for AWS 5.01 and later, when the node failure recovery process fails to replace the downed node, the Teradata Database system keeps running on Fallback. To replace the downed node, log in to https://access.teradata.com and search for KCS009816.
When Using EBS Storage
When a node fails, a virtual hot standby node instance automatically spins up, detaches the EBS storage of the failed node, configures a new instance, reattaches the EBS storage to the new instance, and the configuration is reinstated.
If a node fails and another instance cannot be provisioned in the same placement group, or if you have issues with EBS data in the EBS volume, the system continues to run on Fallback. Contact Teradata Customer Support to spin up a new instance in the placement group and start Fallback recovery.
When Using Local Storage
Node failures are handled differently on local storage instance types. When a node fails, the data is lost. Although the node is replaced and comes back online, the AMPs on the recovered instance display as FATAL and offline. The other vprocs on the system are online and in the configuration. To fully restore a local storage instance, you must rebuild the AMPs. Until the AMPs are rebuilt, data is maintained in fallback rows on the other AMPs.