High Available croit Management Containers

Afraid that the croit management server is a single point of failure in your otherwise highly available storage? Don't be!

A Word of Caution/The Recomended Simple Setup

The setup described below is overkill for most setups. Our recommended setup is to use our cloud backup feature that allows you to re-create the container on any system if you ever lose it. You can also run the croit management server in a VM and either keep a backup of this VM or make the VM itself highly available.

Note that a downtime of the croit management VM will not impact the operation of your Ceph cluster, croit only stores the server configuration and it is only required during server boot.

This doesn't quite fit the definition of highly availability because restoring it incurs a short downtime of the management server, but Ceph itself and all croit-configured services on Ceph stay available.

If you need the croit management service itself to be highly available then the following guide is for you.

Step 1: Use a Highly Available MARIADB Cluster

croit keeps all its important internal data in a MariaDB database, so we need to make that highly available. Deploy a MariaDB cluster to get started, refer to the MariaDB Galera Cluster documentation.

Importing an Existing DB

Use mysqldump to export the database croit in the MariaDB setup running inside the container. You can dump directly from outside the management container to your management host using:

docker exec -it croit mysqldump croit > croit.sql

Note: Alternatively use docker exec -it croit bash to enter the container, mysqldump is already installed there.

Import this dump into your MariaDB cluster.

Fresh Setup

croit expects the database croit to already exist, so create it.

create database croit;

Connect croit to an External Database

Use docker exec -it croit bash to enter the container and edit the section database in the file /config/config.yml:

database:
user:
password:
url: jdbc:mariadb:sequential//,/croit
driverClass: org.mariadb.jdbc.Driver

Restart the container by executing docker restart croit afterwards and verify operations.

Step 2: Statistics

croit uses Graphite for statistics and stores them internally, so we'll need to store them somewhere else to have them highly available as well.

We recommend go-graphite for statistics, but the default Python server is also supported.

Adjust the following settings in /config/config.yml:

# server from where the dashboard fetches statistics
graphiteServer: http://:@:
# prefix used for all metrics, can be adjusted
graphiteMetricPrefix: croit.
# (optional) use a proxy server
#graphiteProxy:
# server to which the servers send their metrics, usually the same server as above
graphiteTarget:
graphiteTargetPort: <port, usually 2003>

Restart the container by executing docker restart croit afterwards and verify operations.

Note that having Graphite highly available is optional, croit will continue to work when statistics are not available.

Step 3: Log Files (Optional)

All servers send their log files to the management server in systemd format by default. This means you'd lose access to them after a failover occurs, so send them to another syslog server instead.

Add the setting syslogTarget: to /config/config.yml to configure the log server.

Note: Since croit only stores the logs for a very short time, we do not believe there is any significant advantage in setting this up.

Step 4: Deploy an Additional croit Container and Configure a Virtual IP

The last step is to deploy additional croit containers and configure failover between them. Copy your /config/config.yml into each container and keep them in sync if you change them.

Only one croit container can be active at the same time because we can only support a single IP address for the DHCP/PXE boot process. So we need a virtual IP address that follows the currently running croit container.

A possible way to achieve this is by using Pacemaker/Corosync, refer to the Pacemaker manual for details on configuring it.

You can use the following custom OCF Resource Agent to handle the croit container and virtual IP address. Do not set the restart=always option in Docker for containers managed by Pacemaker, otherwise Pacemaker gets confused and fails.

#!/bin/bash
: ${OCF_FUNCTIONS=${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs}
. ${OCF_FUNCTIONS}
: ${__OCF_ACTION=$1}
croit_meta_data() {
    cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="croit" version="1.0">
<version>1.0</version>
<longdesc lang="en">
This resource agent manages a croit docker container and its associated PXE boot IP.
</longdesc>
<shortdesc lang="en">croit resource agent</shortdesc>
<parameters>
<parameter name="ip" unique="1" required="1">
<shortdesc lang="en">IP address used for PXE in CIDR notation</shortdesc>
<content type="string" default="" />
</parameter>
<parameter name="nic" unique="1" required="1">
<shortdesc lang="en">interface on which to bind the IP</shortdesc>
<content type="string" default="" />
</parameter>
</parameters>
<actions>
<action name="start"     timeout="20s" />
<action name="stop"      timeout="20s" />
<action name="monitor"   timeout="20s" interval="10s" depth="0"/>
<action name="meta-data" timeout="5s" />
</actions>
</resource-agent>
END
}
croit_usage() {
	echo "usage: $0 {start|stop|monitor|meta-data}"
}
croit_start() {
	ip link set up dev $OCF_RESKEY_nic
	ip address add $OCF_RESKEY_ip dev $OCF_RESKEY_nic
	if ! ip address show dev "$OCF_RESKEY_nic" | grep "$OCF_RESKEY_ip" ; then
		ocf_log err "failed to add IP address"
		return $OCF_ERROR
	fi
	arping -c 1 -I $OCF_RESKEY_nic -S $OCF_RESKEY_ip -A $OCF_RESKEY_ip -q
	arping -c 1 -I $OCF_RESKEY_nic -S $OCF_RESKEY_ip -U $OCF_RESKEY_ip -q
	docker start croit
	if [[ $? != 0 ]]; then
		ocf_log err "failed to start croit container"
		return $OCF_ERROR
	fi
	return $OCF_SUCCESS
}
croit_stop() {
	ip address delete $OCF_RESKEY_ip dev $OCF_RESKEY_nic
	docker stop croit
	return $OCF_SUCCESS
}
croit_monitor() {
	status=$(docker inspect croit | jq -r .[0].State.Status)
	if [[ $status == "" ]] ; then
		ocf_log err "could not find croit container status -- is the container deployed?"
		return $OCF_ERR_GENERIC
	fi
	if [[ $status == "exited" ]] ; then
		return $OCF_NOT_RUNNING
	fi
	if ! ip address show dev "$OCF_RESKEY_nic" | grep "$OCF_RESKEY_ip" ; then
		ocf_log err "ip address is not bound but container not in state exited"
		return $OCF_ERROR
	fi
	health=$(docker inspect croit | jq -r .[0].State.Health.Status)
	if [[ $health == "unhealthy" ]] ; then
		ocf_log err "croit container is unhealthy, check container"
		return $OCF_ERR_GENERIC
	fi
	return $OCF_SUCCESS
}
case $__OCF_ACTION in
meta-data)      croit_meta_data
                exit $OCF_SUCCESS
                ;;
start)          croit_start;;
stop)           croit_stop;;
monitor)        croit_monitor;;
usage|help)     croit_usage
                exit $OCF_SUCCESS
                ;;
*)              croit_usage
                exit $OCF_ERR_UNIMPLEMENTED
                ;;
esac
rc=$?
ocf_log debug "${OCF_RESOURCE_INSTANCE} $__OCF_ACTION : $rc"
exit $rc

This script requires the jq and arping utilities installed. You can install them on Debian systems by executing apt install jq arping.

Note On Backups and External Databases

Backups internally only contain two things: a dump of the entire MariaDB and everything under /config in the container. Our automatic cloud backup can only handle the embedded local database. So make sure to have a backup of your config and the distributed MariaDB database for disaster recovery.

Check out our scripts for manual backups as an example.