- cOS Core version 10 and up.
My High Availability cluster is not synchronizing properly, also i have seen incidents where the cluster changes role for no apparent reason.
Problems with synchronization and cluster role changes can of course be all kinds of reasons such as hardware problem on sync interface, bad cable, incorrect configuration etc. but if we look at some of the Advanced Settings for High Availability there are some settings here that may need to be adjusted. For most these settings never need to be changed but for larger installations it is recommended to modify them to incorporate large synchronization data and (based on scenario) lessen the chance that the cluster performs a failover due to lack of heartbeats from it's peer.
The settings that we want to adjust are the following and can be found under System->High Availability->Advanced:
- Sync Buffer Size, default value 1024
- Recommended value : 2048
Update 2017-10-18: The new default value for the sync buffer is now set to be 4096 for all newly generated configurations (version 12 and up).
- Sync Packet Max Burst, default value 20
- Recommended value : 60-100 (depending on the size of the installation in question)
Update 2017-10-18: The new default value for the packet burst is now set to be 100 for all newly generated configurations (version 12 and up).
- HA Failover Time, default values 750ms
- Recommended value : 1500ms
Depending on the scenario/size/network structure, 750ms can be a bit low. In case the system encounters network packet bursts it could result in the inactive declaring the active node as inactive and then go active itself. Then you could enter an active/active state and then the clusters start to negotiate which node that should be the active node. This in turn could cause disruptions in the network.
One way to make the cluster "less" sensitive to minor network "hickups" would be to increase this value.
Note: The higher the value here the longer it would take for the inactive node to take over in case something happens with the active node. The value configured here will have to be based on what is reasonable acceptable, is 1.5 seconds of total network outage acceptable in case something happens with the active node? It will be up to the administrator to decide.