Ansible: Monitoring and Baselining a Ceph Cluster

Written by Michael Sevilla

In this post we will check the health of our Ceph deployment by benchmarking the RADOS write speed, network bandwidth, and disk bandwidth. We assume that you have set up a cluster already using the Ansible: Running Ceph on a Cluster blog. We tied these two blogs together because the monitoring tools are provided by our ceph-popper-template.

Monitoring Ceph

Ceph provides the Calamari monitoring framework but when we started our work with Ceph it was not open source or production quality. We built our own framework using Docker containers. The images package the following tools:

All you have to do is add an IP for Graphite:

diff --git a/site/hosts b/site/hosts
index 664cf92..0f00a25 100644
--- a/site/hosts
+++ b/site/hosts
@@ -9,3 +9,6 @@
 
 [clients]
 <ADD CLIENTs>
+
+[graphite]
+piha.soe.ucsc.edu ansible_ssh_user=issdm

If you set up the cluster using our Ansible/Docker: Running Ceph on a Cluster blog then these tools were started in containers when you ran deploy.sh. If you log in to one of the Ceph OSDs we see:

:~$ docker ps
IMAGE                                       COMMAND             NAMES
michaelsevilla/collectl                     "/entrypoint.sh"    collectl
piha.soe.ucsc.edu:5000/ceph/daemon:master   "/entrypoint.sh"    issdm-0-osd-devsde

We can add any type of monitoring services to any node using site/ceph_monitor.yml:

---

- hosts: graphite
  become: True
  roles: 
    - monitor/graphite

- hosts: osds
  become: True
  roles:
    - monitor/collectl

These tools are configured in the same way that the Ceph daemons are configured: with the variables in the site/group_vars directory. We look at the Graphite configuration as an example:

~/experiment$ cat site/group_vars/graphite
---
webui_port: 8082

These variables can be changed if you need to use different ports on your cluster. To understand how these variables are used we need to look at the deploy code for the service. In this case we look at the Ansible deploy code for Graphite here.

We see that WEBUI_PORT is passed into the Docker container as an environment variable. To understand what the container does with this environment variable, we look at the Graphite image source code; taking a peek we see that it is launched with the entrypoint for the Docker image saved in the srl-roles repository. That entrypoint shows that WEBUI_PORT is passed as a command line argument:

./bin/run-graphite-devel-server.py --port $WEBUI_PORT /opt/graphite

Welcome to the world of Ansible/Docker – things are linked all over the place and figuring out what calls what takes us on a journey that traverses multiple repositories and content hosting services. :)

Running the Baseliner

To ensure that the cluster is healthy run the Ceph baseliner.sh tool, which is packaged with our experiment template repository. This runs sanity checks inspired by the Ceph blog post and benchmarks RADOS write throughput, network bandwidth, and disk IO.

~/experiment$ ./baseliner.sh

While it is running, we can explore the different metrics using Graphite – we have a video showing off some of the powerful visualization features:

Graphite

When the benchmarks finish the results are stored in the results directory.

Visualizing Results

Reading files of results is hard… especially when the formats differ. We package the scripts for parsing and graphing results in a Python notebook. To interact with the results and graphs run:

~/experiment$ docker run -v `pwd`/results:/home/jovyan/work -p 81:8888 jupyter/scipy-notebook

The graphs are in visualize.ipynb and if you open that up you see the parsing code and graphs for the results. The notebook for our results is rendered by GitHub here (but note that GitHub does not make the notebook interactive). We have a screenshot of the notebook below:

a

Awesome. Now we can monitor and measure our cluster.



Jekyll theme inspired by researcher

Don't click on this easter egg: A juicy easter egg!