When clustering AEM instances there is a required order for shutting down and starting the instances. This however can be very tricky in a shared nothing cluster as the master & slave can automatically swap places at any time for multiple reasons. This means in order to shut down an instance you typically need to identify the master instance first. Conversely in starting up an instance the slave can only be started after the master has become available. If they are started out of order the slave instance will retain its process but will never connect to the master and thus will be a dead process that must be stopped and restarted. This document will describe some changes that can be made to the start and stop scripts that will ensure the instances will be stopped in an acceptable order as well as prevent the slave node from starting up without a master. It does however have 2 requirements, an administrative password must be stored in the stop script, and this is designed only for a cluster of 2.

Changes to the start script

The changes that need to be made to the start script utilize some mechanics within AEM and some external curl commands. Firstly I check for the presence of crx-quickstart/repository/clustered.txt which the presence of will identify that this instance needs to be started as a slave. If this is the case then the script will execute a curl command to the other instance retaining the response code of the request. If the response is a code 200 then the master instance is up and running therefore the slave should be safe to start but if not then the script will wait a small delay and try the curl command again and this process will repeat indefinitely until a 200 status is returned.

To configure this open the start script located in crx-quickstart/bin/ and find the line marking stating where not to edit below. Once you've identified that in the lines above it insert the following code snippet.

#  This section was added to ensure the instance does not start before the master
if [ -a $(dirname $0)/../repository/clustered.txt ]
    echo "This instance was shut down as a slave. Halting startup until the master is available."
    while [ $STATUS != 200 ]
         STATUS=$(curl -o /dev/null -w '%{http_code}' http://aemhost:aemport/libs/granite/core/content/login.html 2>/dev/null)
         sleep 4

Update AEMhost and AEMport to reflect the host and port of the other AEM instance and not the instance to which this start script belongs to.

Changes to the Stop script

The changes to the stop script are a little less intelligent than the start script. It simply does not care if the instance is a slave or not and then pauses for a small delay allowing the changeover to occur. It blindly sends a curl command to the other cluster node instructing it to become the cluster master node. If the other node is already the master then it was an unnecessary call that has no effect Otherwise it will swap the roles and therefore allow the instance being shut down to go down cleanly in proper order thus not breaking the cluster.

To configure this edit the stop script located in crx-quickstart/bin/ and locate the line containing the following text "java -jar $CQ_JARFILE $START_OPTS". Above this line insert the following code:

if [ -a repository/clustered.txt ]; then
        echo "This is a clustered instance ensuring this is not the master by ordering other instance to assume the role"
        curl -u ${LOGIN}:${PASSWORD} -X POST http://${MASTER_SERVER}/system/console/jmx/com.adobe.granite%3Atype%3DRepository/op/becomeClusterMaster/
        sleep 10
  • Be sure to set the server IP/hostname and port in the MASTER_SERVER variable as well as updating credentials in the Login and Password variables.
  • The MASTER_SERVER should reference the other clustered instance and not that to which this script is configured for.