Details about HA cluster mechanics, mitigating problems with both cluster nodes active or inactive

Security Gateway Articles and How to's
Post Reply
Peter
Posts: 611
Joined: 10 Apr 2008, 14:14
Location: Clavister HQ - Örnsköldsvik

Details about HA cluster mechanics, mitigating problems with both cluster nodes active or inactive

Post by Peter » 23 Dec 2016, 13:16

This How-to applies to:
  • Clavister Security Gateway 10.x and up.
Description:

When using a Clavister High Availability cluster there may be situations where the cluster is not behaving the correct way, there are primary two situations that are the most common:

1. Both Master and Slave nodes are active at the same time
2. Both Master ans Slave nodes are inactive at the same time.

Note: The problem symptom could also be that the cluster nodes are changing role very frequently.

This How-To will discuss both scenarios and possible ways to troubleshoot and solve them.

Note: This How-To assumes that the Active/Active configuration scenario described in the admin guide is NOT used.

Discussion:
  • Scenario-1 : Both cluster nodes active at the same time.[/size]
This is the most common problem we have observed and it basically means that the Master node cannot find the Slave node and the Slave node cannot find the Master node. In order explain this further we have an example cluster that currently have 5 interfaces as shown in the following picture.
Cluster.png
Cluster.png (93.59 KiB) Viewed 1816 times
This is a standard setup of a cluster. Every physical interface is connected to various switches, for instance we have G1 from both Master and Slave connected to the same switch and similar setup for all the other interfaces. The exception is the sync interface (G5) which is connected directly between the Master and Slave units using a standard TP cable, so nothing in between.

In order for the Master to find the Slave unit and wise versa the cluster is sending out heartbeats on all PHYSICAL interfaces. Heartbeats is basically a method for the cluster nodes to declare it's peer as up or down. We can think of them as a kind of ping packet constantly being sent and received that tells the cluster peer that it's up.

Important: Heartbeats are NOT sent out on non-physical interfaces such as VLAN's, IPSec tunnels, GRE tunnels, PPTP connections, Loopback interfaces etc. If VLANs are used on a switch, untagged VLAN0 must be configured on the switch to allow heartbeat packets to be sent between the Firewalls.

Each interface is sending heartbeats at a set interval and each cluster peer then decides if it's cluster peer should be declared as up or down based on this data.

But at the same time this it means we have some redundancy. In our example we have 5 interfaces in total, what if two of them is not used? They do not even have any link.
Cluster_NoLink.png
Cluster_NoLink.png (94.33 KiB) Viewed 1816 times
Would the cluster go active/active in this scenario? No, they would not. The reason for that is because the cluster peers receives/sends heartbeats from the other interfaces to still declare the cluster peer as alive.

Question: What if all interfaces except the sync is working? Would they go active/active then?
Answer: No, they would not. The synchronization interface sends out heartbeats at a fixed rate that is normally more than twice the amount of heartbeats sent on a normal interface, but it would still be more than enough to declare the cluster peer(s) as alive. There would be no state synchronization though, nor any configuration synchronization unless InControl is used.

Question: What if the reverse? All interfaces except sync is down, would that cause an active/active situation?
Answer: No, it would not. As long as we have even one interface working that can send/receive heartbeats it would be enough for the cluster to see it's cluster peer and to know which one should be the active or inactive. Having only one interface able to send/receive heartbeats however makes the cluster very sensitive and the slightest network "hickup" may cause the cluster to failover and/or start going into active/active state. It is recommended to have heartbeats enabled and working on as many interfaces as possible, the more the better.

Question: If the Sync interface is down, how can the cluster determine which node that should be the active one?
Answer: In the heartbeat packets being sent out on all interfaces is also information about the amount of connections the sender node has, the cluster nodes can then determine which one should be the active node by comparing the connection count between itself and it's peer. The one with the most connections will be the active node.

We mentioned earlier that the synchronization interface sends more heartbeats than a normal interface, but the principle of this is (simplified) like this sequence:
  • 1. Heartbeat sent on G1
    2. Heartbeat sent on Sync
    3. Heartbeat sent on G2
    4. Heartbeat sent on Sync
    5. Heartbeat sent on G3
    6. Heartbeat sent on Sync
Etc. so even if only the Sync interface is able to send/receive the heartbeats, it sends them at a higher rate that makes the cluster less sensitive to enter an active/active situation than if only a normal non-sync interface had been the one used for heartbeats.
  • Heartbeat characteristics:
  • • The source IP is the interface address of the sending security gateway.
    • The destination IP is the broadcast address on the sending interface.
    • The IP TTL is always 255. If cOS Core receives a cluster heartbeat with any other TTL, it is
    assumed that the packet has traversed a router and therefore cannot be trusted.
    • It is a UDP packet, sent from port 999, to port 999.
    • The destination MAC address is the Ethernet multicast address corresponding to the shared
    hardware address and this has the form:
  • 11-00-00-00-NN-MM
  • Where NN is a bit mask made up of the interface bus, slot and port on the master and MM
    represents the cluster ID,

    Link layer multicasts are used over normal unicast packets for security. Using unicast packets
    would mean that a local attacker could fool switches to route heartbeats somewhere else so
    the inactive system never receives them.
Note: This means that due to the characteristics of the heartbeats they are designed to NOT traverse any routers and is one of the primary reasons why it cannot be sent over e.g. VPN or VLAN's.

Solution:

The main problem when encountering an active/active situation is, as discussed above, the lack of heartbeats. But there may be situations where this is due to how the network is designed. For instance all interfaces are going through something that actively scans the traffic, and since heartbeats are specifically designed NOT to pass through any "jump" like a router or similar they will most likely be dropped by said equipment as they find the packets suspicious.

The main problem of an active/active situation is that the HA cluster nodes are not receiving enough heartbeats from it's cluster peer. It means that not even one interface is able to both send and receiver heartbeats if both nodes go active at the same time.

There are four items that can used/investigated to try limit the chance of active/active situations.
  • Solution-1: Disable Heartbeats
The first solution to this problem is to disable the sending of heartbeats on interface that are not in use or interfaces that we know cannot send/receive heartbeats from it's cluster peer.

The option to disable cluster heartbeats can be found under the advanced tab on each physical interface.
DisableHeartbeats.png
DisableHeartbeats.png (23.12 KiB) Viewed 1815 times
It is recommended to make a comment in the comment field on interfaces you disable heartbeats on for future references. It is also recommended that if/when the interfaces is used, that the sending of heartbeats is activated again.

This is an optional setting, since even one interface is enough to send/receive heartbeats this is primary way of stop the cluster for sending heartbeats on interfaces that are not actively in use in order to stop the cluster from generating packets on this particular interface/network.

  • Solution-2: Disable interfaces that are not in use.

The second solution is to disable the interface itself. Simply right-click the interface and select "disable interface". This will stop the sending of heartbeats as cOS Core will basically be unaware that the interface exists in the first place and will not take it into consideration regarding heartbeats.

WARNING! Before you disable any interfaces you must make sure it is NOT the registration interface you are disabling. If you disable the interface to which the license is bound to, the Firewall would enter Lockdown mode! The easiest way to check which interface the license is bound to is to first use the CLI command "license" and then combine that with the CLI command "ifstat -maclist" to see a list of all interfaces and their MAC addresses.
  • Solution-3: Identical hardware settings on Virtual Security Gateways
In an HA cluster it is very important that the Master and Slave have the same hardware settings (PCI-Bus, Slot & Port). The reason for this is because the Shared MAC address is calculated based on hardware settings + the cluster ID. And if the hardware settings on the cluster nodes are different, the Shared MAC address becomes different as well, causing the cluster peers to think that the heartbeats and synchronizations packets aren't from it's cluster peer.

When using a Virtual Security Gateway (VSG) it is very important that both cluster nodes are created using the same virtual hardware settings. If the hardware settings are not the same, there is a very high chance of an active/active scenario. The logs would most likely also be filled with events about "disallowed_on_sync_iface".

The easiest way to compare the hardware settings between the cluster nodes is to download a Technical Support File (TSF) from e.g. the WebUI from both cluster nodes and then compare their hardware sections.
  • Solution-4: Investigate high Diffie-Hellman (DH) group usage on IPsec tunnels
This is a problem we have started seeing more and more of since the introduction of DH group 14-18 in version 10.20.00. The problem of using DH group 14 and above is that the CPU required to generate these very strong keys is very intense and can cause system stalls on hardware not powerful enough. DH group 18 for instance is a 8192 bit key and that requires a very powerful hardware in order to avoid system interruptions.

The effect of having too high DH groups will be that the inactive node believes it's peer is gone and goes active. To troubleshoot, examine the logs prior to an active/active event to see if there was an IPsec tunnel negotiation prior to the event or perform a configuration review and examine if DH groups of 14 or above is used on an IPsec tunnel. A tunnel with many local and remote networks will also generate many DH key negotiations, and having many DH negotiations gong on at the same will would make the situation even worse.

  • Final note:

It is important to highlight again that the main problem of an active/active situation is due to the lack of heartbeats. As long as heartbeats is unable to traverse from the Master to the Slave and wise versa, it is a potential source of problems. This should be investigated first since even one interface should be enough for the cluster nodes to see it's cluster peer.
  • Scenario-2: Both cluster nodes inactive at the same time.
This scenario is very unusual/rare. It means that both cluster nodes have received heartbeat data from it's cluster peer indicating that it has more connections and should be the active node. The problem is that both cluster nodes believe the other cluster peer has more connections and so both cluster peers will stay inactive.

Solution:

The solution to this problem is not as obvious as it can be very strange network problems causing such an issue, but here are a few tips on what to check:

1. In case there are more than one Clavister Cluster configured in the network and that they can see each other, verify that they are NOT using the same cluster ID. If same cluster ID is used there is a chance that the clusters will see heartbeats from the other cluster peers as it's own.
1.1. A common log entry during this kind of problem is "heartbeat_from_myself".
2. Check the network in case there is some sort of port mirroring or other equipment that mirrors the packets sent by the cluster, so the cluster sees it's own heartbeats.
3. Verify that the cluster is correctly configured, for instance so that both are not configured as Master or both as Slave node type.
4. Make sure that the IP addresses on the interfaces is not the same on Shared, Master_IP and Slave_IP. If for instance all 3 are using the same IP address, you would encounter very strange problems and the logs may contain "heartbeat_from_myself" entries as well.

Adjusting HA advanced settings:

For information about some of the common HA advanced settings that may need modifications in larger environments please check out the FAQ about adjusting advanced HA settings: viewtopic.php?f=18&t=5744

Post Reply