Adjusting advanced cluster settings on larger installations

Frequently Asked Questions
Post Reply
Peter
Posts: 671
Joined: 10 Apr 2008, 14:14
Location: Clavister HQ - Örnsköldsvik

Adjusting advanced cluster settings on larger installations

Post by Peter » 23 Mar 2016, 10:40

This FAQ applies to:
  • cOS Core version 10 and up.
Question:

My High Availability cluster is not synchronizing properly, also i have seen incidents where the cluster changes role for no apparent reason.

Answer:

Problems with synchronization and cluster role changes can of course be all kinds of reasons such as hardware problem on sync interface, bad cable, incorrect configuration etc. but if we look at some of the Advanced Settings for High Availability there are some settings here that may need to be adjusted. For most these settings never need to be changed but for larger installations it is recommended to modify them to incorporate large synchronization data and (based on scenario) lessen the chance that the cluster performs a failover due to lack of heartbeats from it's peer.

The settings that we want to adjust are the following and can be found under System->High Availability->Advanced:
  • Sync Buffer Size, default value 1024
  • Recommended value : 2048
This setting controls how much synchronization data (in KB) can be buffered before waiting for acknowledgement from it's cluster peer. Today's appliance models (E80 and above) have quite a lot of spare memory, so allocating 2 MB instead of one should be no problem, having a little extra buffer for the synchronization will never hurt.

Update 2017-10-18: The new default value for the sync buffer is now set to be 4096 for all newly generated configurations (version 12 and up).
  • Sync Packet Max Burst, default value 20
  • Recommended value : 60-100 (depending on the size of the installation in question)
This setting controls how many packet the active cluster peer can send in a synchronization state burst to the inactive node. For larger installations (100+ users) it is highly recommended to increase this value, using the default value can cause the active node to be unable to synchronize data fast enough. Meaning the inactive node may not be fully synchronized with the active.

Update 2017-10-18: The new default value for the packet burst is now set to be 100 for all newly generated configurations (version 12 and up).
  • HA Failover Time, default values 750ms
  • Recommended value : 1500ms
This setting controls how long the inactive node node will "wait" before going active in case it has not received sufficient heartbeats from it's peer within this time. Simply speaking if the inactive node has not "seen" the active node for 750ms it will go active.

Depending on the scenario/size/network structure, 750ms can be a bit low. In case the system encounters network packet bursts it could result in the inactive declaring the active node as inactive and then go active itself. Then you could enter an active/active state and then the clusters start to negotiate which node that should be the active node. This in turn could cause disruptions in the network.

One way to make the cluster "less" sensitive to minor network "hickups" would be to increase this value.

Note: The higher the value here the longer it would take for the inactive node to take over in case something happens with the active node. The value configured here will have to be based on what is reasonable acceptable, is 1.5 seconds of total network outage acceptable in case something happens with the active node? It will be up to the administrator to decide.

Post Reply