- Clavister Security Gateway 10.x and up.
When using a Clavister High Availability cluster there may be situations where the cluster is not behaving the correct way, there are primary two situations that are the most common:
1. Both Master and Slave nodes are active at the same time
2. Both Master ans Slave nodes are inactive at the same time.
Note: The problem symptom could also be that the cluster nodes are changing role very frequently.
This How-To will discuss both scenarios and possible ways to troubleshoot and solve them.
Note: This How-To assumes that the Active/Active configuration scenario described in the admin guide is NOT used.
- Scenario-1 : Both cluster nodes active at the same time.[/size]
In order for the Master to find the Slave unit and wise versa the cluster is sending out heartbeats on all PHYSICAL interfaces. Heartbeats is basically a method for the cluster nodes to declare it's peer as up or down. We can think of them as a kind of ping packet constantly being sent and received that tells the cluster peer that it's up.
Important: Heartbeats are NOT sent out on non-physical interfaces such as VLAN's, IPSec tunnels, GRE tunnels, PPTP connections, Loopback interfaces etc. If VLANs are used on a switch, untagged VLAN0 must be configured on the switch to allow heartbeat packets to be sent between the Firewalls.
Each interface is sending heartbeats at a set interval and each cluster peer then decides if it's cluster peer should be declared as up or down based on this data.
But at the same time this it means we have some redundancy. In our example we have 5 interfaces in total, what if two of them is not used? They do not even have any link. Would the cluster go active/active in this scenario? No, they would not. The reason for that is because the cluster peers receives/sends heartbeats from the other interfaces to still declare the cluster peer as alive.
Question: What if all interfaces except the sync is working? Would they go active/active then?
Answer: No, they would not. The synchronization interface sends out heartbeats at a fixed rate that is normally more than twice the amount of heartbeats sent on a normal interface, but it would still be more than enough to declare the cluster peer(s) as alive. There would be no state synchronization though, nor any configuration synchronization unless InControl is used.
Question: What if the reverse? All interfaces except sync is down, would that cause an active/active situation?
Answer: No, it would not. As long as we have even one interface working that can send/receive heartbeats it would be enough for the cluster to see it's cluster peer and to know which one should be the active or inactive. Having only one interface able to send/receive heartbeats however makes the cluster very sensitive and the slightest network "hickup" may cause the cluster to failover and/or start going into active/active state. It is recommended to have heartbeats enabled and working on as many interfaces as possible, the more the better.
Question: If the Sync interface is down, how can the cluster determine which node that should be the active one?
Answer: In the heartbeat packets being sent out on all interfaces is also information about the amount of connections the sender node has, the cluster nodes can then determine which one should be the active node by comparing the connection count between itself and it's peer. The one with the most connections will be the active node.
We mentioned earlier that the synchronization interface sends more heartbeats than a normal interface, but the principle of this is (simplified) like this sequence:
- 1. Heartbeat sent on G1
2. Heartbeat sent on Sync
3. Heartbeat sent on G2
4. Heartbeat sent on Sync
5. Heartbeat sent on G3
6. Heartbeat sent on Sync
- Heartbeat characteristics:
- • The source IP is the interface address of the sending security gateway.
• The destination IP is the broadcast address on the sending interface.
• The IP TTL is always 255. If cOS Core receives a cluster heartbeat with any other TTL, it is
assumed that the packet has traversed a router and therefore cannot be trusted.
• It is a UDP packet, sent from port 999, to port 999.
• The destination MAC address is the Ethernet multicast address corresponding to the shared
hardware address and this has the form:
- Where NN is a bit mask made up of the interface bus, slot and port on the master and MM
represents the cluster ID,
Link layer multicasts are used over normal unicast packets for security. Using unicast packets
would mean that a local attacker could fool switches to route heartbeats somewhere else so
the inactive system never receives them.
The main problem when encountering an active/active situation is, as discussed above, the lack of heartbeats. But there may be situations where this is due to how the network is designed. For instance all interfaces are going through something that actively scans the traffic, and since heartbeats are specifically designed NOT to pass through any "jump" like a router or similar they will most likely be dropped by said equipment as they find the packets suspicious.
The main problem of an active/active situation is that the HA cluster nodes are not receiving enough heartbeats from it's cluster peer. It means that not even one interface is able to both send and receiver heartbeats if both nodes go active at the same time.
There are four items that can used/investigated to try limit the chance of active/active situations.
- Solution-1: Disable Heartbeats
The option to disable cluster heartbeats can be found under the advanced tab on each physical interface. It is recommended to make a comment in the comment field on interfaces you disable heartbeats on for future references. It is also recommended that if/when the interfaces is used, that the sending of heartbeats is activated again.
This is an optional setting, since even one interface is enough to send/receive heartbeats this is primary way of stop the cluster for sending heartbeats on interfaces that are not actively in use in order to stop the cluster from generating packets on this particular interface/network.
- Solution-2: Disable interfaces that are not in use.
The second solution is to disable the interface itself. Simply right-click the interface and select "disable interface". This will stop the sending of heartbeats as cOS Core will basically be unaware that the interface exists in the first place and will not take it into consideration regarding heartbeats.
WARNING! Before you disable any interfaces you must make sure it is NOT the registration interface you are disabling. If you disable the interface to which the license is bound to, the Firewall would enter Lockdown mode! The easiest way to check which interface the license is bound to is to first use the CLI command "license" and then combine that with the CLI command "ifstat -maclist" to see a list of all interfaces and their MAC addresses.
- Solution-3: Identical hardware settings on Virtual Security Gateways
When using a Virtual Security Gateway (VSG) it is very important that both cluster nodes are created using the same virtual hardware settings. If the hardware settings are not the same, there is a very high chance of an active/active scenario. The logs would most likely also be filled with events about "disallowed_on_sync_iface".
The easiest way to compare the hardware settings between the cluster nodes is to download a Technical Support File (TSF) from e.g. the WebUI from both cluster nodes and then compare their hardware sections.
- Solution-4: Investigate high Diffie-Hellman (DH) group usage on IPsec tunnels
The effect of having too high DH groups will be that the inactive node believes it's peer is gone and goes active. To troubleshoot, examine the logs prior to an active/active event to see if there was an IPsec tunnel negotiation prior to the event or perform a configuration review and examine if DH groups of 14 or above is used on an IPsec tunnel. A tunnel with many local and remote networks will also generate many DH key negotiations, and having many DH negotiations gong on at the same will would make the situation even worse.
- Final note:
It is important to highlight again that the main problem of an active/active situation is due to the lack of heartbeats. As long as heartbeats is unable to traverse from the Master to the Slave and wise versa, it is a potential source of problems. This should be investigated first since even one interface should be enough for the cluster nodes to see it's cluster peer.
- Scenario-2: Both cluster nodes inactive at the same time.
The solution to this problem is not as obvious as it can be very strange network problems causing such an issue, but here are a few tips on what to check:
1. In case there are more than one Clavister Cluster configured in the network and that they can see each other, verify that they are NOT using the same cluster ID. If same cluster ID is used there is a chance that the clusters will see heartbeats from the other cluster peers as it's own.
1.1. A common log entry during this kind of problem is "heartbeat_from_myself".
2. Check the network in case there is some sort of port mirroring or other equipment that mirrors the packets sent by the cluster, so the cluster sees it's own heartbeats.
3. Verify that the cluster is correctly configured, for instance so that both are not configured as Master or both as Slave node type.
4. Make sure that the IP addresses on the interfaces is not the same on Shared, Master_IP and Slave_IP. If for instance all 3 are using the same IP address, you would encounter very strange problems and the logs may contain "heartbeat_from_myself" entries as well.
Adjusting HA advanced settings:
For information about some of the common HA advanced settings that may need modifications in larger environments please check out the FAQ about adjusting advanced HA settings: viewtopic.php?f=18&t=5744