High Available croit Management Containers
Afraid that the croit management server is a single point of failure in your otherwise highly available storage? Don't be!
A Word of Caution/The Recomended Simple Setup
The setup described below is overkill for most setups. Our recommended setup is to use our cloud backup feature that allows you to re-create the container on any system if you ever lose it. You can also run the croit management server in a VM and either keep a backup of this VM or make the VM itself highly available.
Note that a downtime of the croit management VM will not impact the operation of your Ceph cluster, croit only stores the server configuration and it is only required during server boot.
This doesn't quite fit the definition of highly availability because restoring it incurs a short downtime of the management server, but Ceph itself and all croit-configured services on Ceph stay available.
If you need the croit management service itself to be highly available then the following guide is for you.
Step 1: Use a Highly Available MARIADB Cluster
croit keeps all its important internal data in a MariaDB database, so we need to make that highly available. Deploy a MariaDB cluster to get started, refer to the MariaDB Galera Cluster documentation.
Importing an Existing DB
Use mysqldump
to export the database croit
in the MariaDB setup running inside the container. You can dump directly from outside the management container to your management host using:
docker exec -it croit mysqldump croit > croit.sql
Note: Alternatively use docker exec -it croit bash
to enter the container, mysqldump
is already installed there.
Import this dump into your MariaDB cluster.
Fresh Setup
croit expects the database croit
to already exist, so create it.
create database croit;
Connect croit to an External Database
Use docker exec -it croit bash
to enter the container and edit the section database
in the file /config/config.yml
:
database:
user:
password:
url: jdbc:mariadb:sequential//,/croit
driverClass: org.mariadb.jdbc.Driver
Restart the container by executing docker restart croit
afterwards and verify operations.
Step 2: Statistics
croit uses Graphite for statistics and stores them internally, so we'll need to store them somewhere else to have them highly available as well.
We recommend go-graphite for statistics, but the default Python server is also supported.
Adjust the following settings in /config/config.yml
:
# server from where the dashboard fetches statistics
graphiteServer: http://:@:
# prefix used for all metrics, can be adjusted
graphiteMetricPrefix: croit.
# (optional) use a proxy server
#graphiteProxy:
# server to which the servers send their metrics, usually the same server as above
graphiteTarget:
graphiteTargetPort: <port, usually 2003>
Restart the container by executing docker restart croit
afterwards and verify operations.
Note that having Graphite highly available is optional, croit will continue to work when statistics are not available.
Step 3: Log Files (Optional)
All servers send their log files to the management server in systemd format by default. This means you'd lose access to them after a failover occurs, so send them to another syslog server instead.
Add the setting syslogTarget:
to /config/config.yml
to configure the log server.
Note: Since croit only stores the logs for a very short time, we do not believe there is any significant advantage in setting this up.
Step 4: Deploy an Additional croit Container and Configure a Virtual IP
The last step is to deploy additional croit containers and configure failover between them. Copy your /config/config.yml
into each container and keep them in sync if you change them.
Only one croit container can be active at the same time because we can only support a single IP address for the DHCP/PXE boot process. So we need a virtual IP address that follows the currently running croit container.
A possible way to achieve this is by using Pacemaker/Corosync, refer to the Pacemaker manual for details on configuring it.
You can use the following custom OCF Resource Agent to handle the croit container and virtual IP address. Do not set the restart=always
option in Docker for containers managed by Pacemaker, otherwise Pacemaker gets confused and fails.
#!/bin/bash
: ${OCF_FUNCTIONS=${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs}
. ${OCF_FUNCTIONS}
: ${__OCF_ACTION=$1}
croit_meta_data() {
cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="croit" version="1.0">
<version>1.0</version>
<longdesc lang="en">
This resource agent manages a croit docker container and its associated PXE boot IP.
</longdesc>
<shortdesc lang="en">croit resource agent</shortdesc>
<parameters>
<parameter name="ip" unique="1" required="1">
<shortdesc lang="en">IP address used for PXE in CIDR notation</shortdesc>
<content type="string" default="" />
</parameter>
<parameter name="nic" unique="1" required="1">
<shortdesc lang="en">interface on which to bind the IP</shortdesc>
<content type="string" default="" />
</parameter>
</parameters>
<actions>
<action name="start" timeout="20s" />
<action name="stop" timeout="20s" />
<action name="monitor" timeout="20s" interval="10s" depth="0"/>
<action name="meta-data" timeout="5s" />
</actions>
</resource-agent>
END
}
croit_usage() {
echo "usage: $0 {start|stop|monitor|meta-data}"
}
croit_start() {
ip link set up dev $OCF_RESKEY_nic
ip address add $OCF_RESKEY_ip dev $OCF_RESKEY_nic
if ! ip address show dev "$OCF_RESKEY_nic" | grep "$OCF_RESKEY_ip" ; then
ocf_log err "failed to add IP address"
return $OCF_ERROR
fi
arping -c 1 -I $OCF_RESKEY_nic -S $OCF_RESKEY_ip -A $OCF_RESKEY_ip -q
arping -c 1 -I $OCF_RESKEY_nic -S $OCF_RESKEY_ip -U $OCF_RESKEY_ip -q
docker start croit
if [[ $? != 0 ]]; then
ocf_log err "failed to start croit container"
return $OCF_ERROR
fi
return $OCF_SUCCESS
}
croit_stop() {
ip address delete $OCF_RESKEY_ip dev $OCF_RESKEY_nic
docker stop croit
return $OCF_SUCCESS
}
croit_monitor() {
status=$(docker inspect croit | jq -r .[0].State.Status)
if [[ $status == "" ]] ; then
ocf_log err "could not find croit container status -- is the container deployed?"
return $OCF_ERR_GENERIC
fi
if [[ $status == "exited" ]] ; then
return $OCF_NOT_RUNNING
fi
if ! ip address show dev "$OCF_RESKEY_nic" | grep "$OCF_RESKEY_ip" ; then
ocf_log err "ip address is not bound but container not in state exited"
return $OCF_ERROR
fi
health=$(docker inspect croit | jq -r .[0].State.Health.Status)
if [[ $health == "unhealthy" ]] ; then
ocf_log err "croit container is unhealthy, check container"
return $OCF_ERR_GENERIC
fi
return $OCF_SUCCESS
}
case $__OCF_ACTION in
meta-data) croit_meta_data
exit $OCF_SUCCESS
;;
start) croit_start;;
stop) croit_stop;;
monitor) croit_monitor;;
usage|help) croit_usage
exit $OCF_SUCCESS
;;
*) croit_usage
exit $OCF_ERR_UNIMPLEMENTED
;;
esac
rc=$?
ocf_log debug "${OCF_RESOURCE_INSTANCE} $__OCF_ACTION : $rc"
exit $rc
This script requires the jq
and arping
utilities installed. You can install them on Debian systems by executing apt install jq arping
.
Note On Backups and External Databases
Backups internally only contain two things: a dump of the entire MariaDB and everything under /config
in the container. Our automatic cloud backup can only handle the embedded local database. So make sure to have a backup of your config and the distributed MariaDB database for disaster recovery.
Check out our scripts for manual backups as an example.