Is it possible to avoid interruptions on HA clusters during configuration deployment? (multiple questions)

Frequently Asked Questions
Post Reply
Peter
Posts: 695
Joined: 10 Apr 2008, 14:14
Location: Clavister HQ - Örnsköldsvik

Is it possible to avoid interruptions on HA clusters during configuration deployment? (multiple questions)

Post by Peter » 21 Sep 2020, 11:13

This FAQ applies to:
  • cOS Core version 12 and up (but can in many cases be applied to older versions as well).
Question:
Application XYZ in my network is interrupted whenever we make configuration changes to the Firewall, how can I avoid this?

Answer:
Short answer:
It is unfortunately not possible to make a configuration change without it affecting the network in some way. The Firewall needs to load the new configuration, apply changes to connection states, synchronize with its cluster peer and much more. Having zero interruptions will unfortunately not be possible no matter how much you tweak the system.

Some functions/features are also not state synchronized between the cluster node such as ALG’s, L2TP sessions and more. Meaning these functions will be interrupted whenever the cluster nodes changes role for one reason or the other. For a complete list of items that are not state synchronized, please see the “Known Limitations” section of cOS Core release notes.

Long answer:
Having interruptions when making a configuration deployment or cluster role change is a known behavior that has existed since Clavister introduced HA clusters in cOS Core many years ago. The biggest problem is not always the lack of state synchronization, even for traffic that is state synchronized it can still cause an interruption depending on the status of the synchronization.

Ponder the situation of when we have a Firewall cluster that handles 500 000 connections and lots of traffic. Connection and state data is constantly being synchronized between the cluster nodes and when a configuration is deployed it means that (depending on settings) the cluster will at some point have to make a role change. Even if cOS Core tries to state synchronize everything there will be a moment during the failover where the previously active cluster node goes inactive and the new active node starts to process the traffic. If during that small window, a new IPsec tunnel was sent to the now inactive node by the connected switch before the switch had time to update its ARP cache and port information on where it should send the data due to the failover. First of all the inactive node would ignore all incoming data to its shared IP and it will (most likely) then perform a reconfiguration and stop processing data during this time.

The effect would be that during that small window (we are probably talking less than 10-20ms) there would be data lost.

This is only an example of a problem that could be extremely difficult to solve as it is heavily dependent on timing and connected equipment behavior.

The more data, connections, tunnels etc. that are processed by the cluster the more data needs to synchronize and the more data risk getting lost whenever a configuration change is done.

Important: Some data will deliberately be lost during a configuration deployment. Depending on what the administrator change in the configuration the amount of disturbances can be big or small. One example would be if you make a change to an IPsec tunnel, then that tunnel and all tunnels below the tunnel you changed needs to be torn down and re-established in order to setup the new security flows. Something that was allowed before the change may not necessary be allowed after. The tunnel and any connections related to the tunnel(s) will then be torn down.

Another example would be if you for example change an IP policy to deny traffic instead of allowing it, the connection table then needs to be refreshed and any already established connection need to be re-evaluated. Traffic that was previously allowed but is now denied, will also be applied to existing connections and the existing connections will be torn down when applicable.

All of this take time and resources from the system. The more tunnels, connections, traffic etc. is being processed by the Firewall, the longer this process will take and the higher the chance of any sort of disturbances in the network. It can of course also be mitigated with faster hardware (CPU, memory and bootmedia, which is one of the many parameters involved when we determine the license parameters/restrictions for a particular hardware.

Question:
But standalone units do not loose sessions that aren’t state synchronized?

Answer:
That is true, but it also has a drawback and that is the entire network would be temporary down while the Firewall performs the configuration deployment operation (sometimes referred to as “reconfigure”).

So instead of only affecting some non-synchronized functions of the Firewall, anything going through the Firewall would potentially be affected. Programs and operating systems that is sensitive to network “hiccups” (such as Citrix) can trigger and interrupt the sessions claiming network problems.

To give a rough comparisons between cluster and standalone Firewalls:

Standalone:
Pros:
  • Sessions that are not state synchronized such as ALG’s, L2TP/IPsec, SSL-VPN and more have a good chance of “surviving” the configuration deployment (assuming the configuration change was not in an area that would cause said sessions to be deliberately removed/torn down).
Cons:
  • Total network outage when the Firewall performs the reconfiguration operation to load the new configuration. Packet losses may occur (depending on the size of the configuration, packet processed and performance/capacity of the hardware).

Cluster:
Pros:
  • State synchronization of the majority of the functions of the Firewall. For state synchronized data there would be (almost) zero data loss during a configuration deployment (assuming the configuration change was not in an area that would cause said sessions to be deliberately removed/torn down). Connected clients will most likely not notice if a role change or configuration deployment was done at all.
Cons:
  • Not everything is state synchronized. Any non-synchronized session risk going down when the cluster changes role, unless the same cluster node becomes the active node again when the configuration deployment procedure is completed (will be discussed a bit more further down).
Question:
Is it possible to make the cluster behave like a standalone unit?

Answer:
Yes, there is a setting under “System->High Availability->Advanced” called “Reconf Failover Time”. The value set here means the amount of time in seconds the cluster maximum takes for a configuration deployment (+ a little margin). So let’s say the cluster takes 5 seconds to perform a configuration deployment, including an IDP and Anti-Virus database compilation. The value you then set on this setting would be around 8-10 seconds.

The exact time a Firewall took for its last reconfiguration can be displayed using the CLI command “reconfigure –timings”.

What this setting does (very roughly explained as it’s more complicated than this in the background) is when a configuration deployment is about to happen on the active node and it receives the configuration it will normally cause the inactive node to take over. But if this setting is used the inactive node will “wait” for a maximum of the seconds configured on this setting before it takes over. Hopefully the active node came back within the configured time and then it will remain the active node after the process is complete.

The drawback with using this setting is that when the active node is doing the reconfiguration operation, all network traffic will be down. So there would be a general network hiccup instead of possible interruptions for specific services. The cluster will in effect behave like a standalone Firewall.

The administrator needs to weigh the advantages/disadvantages of using this option in his network.

Question:
Why do my cluster sometime take longer to deploy? Normally it would take 1-2 seconds but other times it’s at 5+ seconds.

Answer:
Most likely the reason for this is because an update was detected and downloaded for the IDP or Anti-Virus databases. When these database are compiled it will take extra time.

The exact time taken can be observed using the “reconfigure –timings” CLi command.

This problem can be mitigated by configuring the “Status->Maintenance->Update Center->Update Interval” to only allow the system to perform the update checks at the configured interval. Basically don’t allow cOS Core to perform the update check at every reconfigure/configuration deployment.

Question:
Does it matter if I deploy to the Active or Inactive cluster node first? Inactive seems to be recommended, why?

Answer:
Yes, it matters which node we deploy the configuration to. To explain this it would be easier to make a sequence of events showing the difference.

Let’s say the Master is active and Slave inactive in both scenarios before the start of the configuration change.

Scenario-1 – Deploying to inactive node first (default)
1. Inactive Slave node receives the configuration.
2. Inactive Slave node initiates a reconfiguration operation to load the new configuration.
3. Active Master node detects Slave node as dead, but since it is the active node it does not do anything as it is the one that is processing the traffic.
4. Inactive Slave node completes its reconfiguration with the new configuration.
5. Active Master node detects the Slave node as alive and links up with it.
6. Active Master node receives the configuration.
7. Master node initiates a reconfiguration operation to load the new configuration. If the setting “Deactivate Before Reconf” is enabled the Master node goes inactive.
8. Inactive Slave node detects that the Master node has vanished and will go active. The slave node is now the active node.
9. Inactive Master node comes back from its deployment operation and will link up with the now active Slave node.
10. Deployment complete. Master is now inactive and Slave active.

For all state tracked connections, the users behind the Firewalls would most likely not notice anything at all. For non-state tracked connections such as L2TP, the clients need to reconnect in order to continue working. A complete list of non-state tracked connections/data can be found in the release note for any version.

Scenario-2 – Deploying to active node first (default)
1. Active Master node receives the configuration.
2. Master node initiates a reconfiguration operation to load the new configuration. If the setting “Deactivate Before Reconf” is enabled the Master goes inactive.
3. Inactive Slave node detects that the Master node has vanished and will go active. The slave node is now the active node.
4. Inactive Master node comes back from its deployment operation and will link up with the now active Slave node.
5. Active Slave node receives the configuration.
6. Active Slave node initiates a reconfiguration operation to load the new configuration. If the setting “Deactivate Before Reconf” is enabled the Save node goes inactive.
7. Inactive Master node detects that the Slave node has vanished and will go active. The Master node is now the active node.
8. Inactive Slave node comes back from its deployment operation and will link up with the now active Master node.
9. Deployment complete. Master is now active and Slave inactive.

For all state tracked connections, the users behind the Firewalls may notice small hiccups in the network as the cluster have changed role twice. For non-state tracked connections such as L2TP, the clients have a higher chance of “surviving” the procedure as the node that was active before the configuration deployment will still be the active node when the operation is complete. This is no guarantee however as it depends on how long the operation took and the sensitivity of the connected clients.

Both methods have their advantages and disadvantages, which one to use and which setting(s) to change is up to the administrator but Clavister recommends deploying to the inactive node first as that is deemed less disruptive for connected clients.

Post Reply