CephFS Striping Strategy

Written by Michael Sevilla

This is a collection of experiments that explore the CephFS striping strategy as a means to scale-up the storage on individual database nodes (i.e. DB nodes).

Setup

First, download, compile, and start Ceph. If you are starting on a clean machine, we suggest using our Docker images for building Ceph. This Docker container has all the dependencies for building Ceph:

msevilla@pl3:~$ git clone --recursive https://github.com/ceph/ceph.git
msevilla@pl3:~$ docker run --rm -it --privileged --volume `pwd`/ceph:/ceph --entrypoint /bin/bash  cephbuilder/ceph:latest
root@0bb75695c80b:/ceph# mkdir build
root@0bb75695c80b:/ceph# cd build
root@0bb75695c80b:/ceph/build# cmake ..
root@d844d9fc42f9:/ceph/build# make -j24

The change of the command line prompt indicates we have dropped into a container. A description of what the Docker commands are doing is beyond the scope of this tutorial – but if you want more info, check out the docker-cephdev wiki.

Next, start a virtual cluster:

    root@0bb75695c80b:/ceph/build# OSD=10 MON=1 MDS=1 ../src/vstart.sh -n -l --short
    [ ... wait a couple minutes ...]
root@d844d9fc42f9:/ceph/build# bin/ceph -s
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
    cluster 527fe5e0-c2ab-4fd2-ba6c-cfc6ad7970eb
     health HEALTH_WARN
            3 near full osd(s)
     monmap e1: 1 mons at {a=127.0.0.1:6789/0}
            election epoch 3, quorum 0 a
      fsmap e5: 1/1/1 up {0=a=up:active}
        mgr no daemons active 
     osdmap e17: 3 osds: 3 up, 3 in
            flags nearfull,sortbitwise,require_jewel_osds,require_kraken_osds
      pgmap v40: 24 pgs, 3 pools, 2148 bytes data, 20 objects
            1028 GB used, 156 GB / 1185 GB avail
                  24 active+clean

I am not sure how to get rid of the near full OSD warning.

Finally, we can mount CephFS:

root@d844d9fc42f9:/ceph/build# mkdir /cephfs
root@d844d9fc42f9:/ceph/build# bin/ceph-fuse /cephfs
2016-10-27 23:51:34.545317 7f06968faec0 -1 init, newargv = 0x555a4e248ae0 newargc=11
ceph-fuse[18613]: starting ceph client
ceph-fuse[18613]: starting fuse
root@d844d9fc42f9:/ceph/build# mount
[... snip ...]
ceph-fuse on /cephfs type fuse.ceph-fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
root@d844d9fc42f9:/ceph/build# touch /cephfs/test
root@d844d9fc42f9:/ceph/build# ls /cephfs/
test

How does CephFS work?

Files are striped across 4MB chunks and the file system namespace is managed by the metadata cluster.

root@d844d9fc42f9:/ceph/build# touch /cephfs/empty
root@d844d9fc42f9:/ceph/build# echo "hi" > /cephfs/file1
root@d844d9fc42f9:/ceph/build# dd if=/dev/zero of=/cephfs/file2 bs=4k count=2048
root@d844d9fc42f9:/ceph/build# ls -alh /cephfs
total 8.1M
drwxr-xr-x  1 root root    0 Oct 28 00:09 .
drwxr-xr-x 49 root root 4.0K Oct 27 23:51 ..
-rw-r--r--  1 root root    0 Oct 28 00:03 empty
-rw-r--r--  1 root root    3 Oct 28 00:04 file1
-rw-r--r--  1 root root 8.0M Oct 28 00:09 file2
root@d844d9fc42f9:/ceph/build# bin/rados lspools
rbd
cephfs_data_a
cephfs_metadata_a
root@d844d9fc42f9:/ceph/build# bin/rados ls -p cephfs_data_a 
10000000005.00000001
10000000005.00000000
10000000004.00000000

Empty files have no corresponding objects but file1 and file2 are striped over 4MB objects. The format of the object names are: <fileID>.<partitionID>. The <partitionID>s are sequential numbers that correspond to objects. file2 is larger than 4MB so it is striped over multiple objects.

Changing the Striping Strategy

We can change the file properties using extended attributes. Files default to a stripe width of 4MB objects:

root@d844d9fc42f9:/ceph/build# touch /cephfs/file3
root@d844d9fc42f9:/ceph/build# getfattr -n ceph.file.layout.object_size /cephfs/file3
# file: cephfs/file3
ceph.file.layout.object_size="4194304"

We can change the object size with:

root@d844d9fc42f9:/ceph/build# setfattr -n ceph.file.layout.stripe_unit -v 1048576 /cephfs/file3
root@d844d9fc42f9:/ceph/build# setfattr -n ceph.file.layout.object_size -v 1048576 /cephfs/file3
root@d844d9fc42f9:/ceph/build# bin/rados -p cephfs_data_a ls
root@d844d9fc42f9:/ceph/build# dd if=/dev/zero of=/cephfs/file3 bs=4k count=2048
2048+0 records in
2048+0 records out
8388608 bytes (8.4 MB) copied, 0.497699 s, 16.9 MB/s
root@d844d9fc42f9:/ceph/build# bin/rados -p cephfs_data_a ls
10000000007.00000001
10000000007.00000006
10000000007.00000004
10000000007.00000002
10000000007.00000000
10000000007.00000005
10000000007.00000003
10000000007.00000007

Unfortunately, you cannot have mixed object stripe sizes:

root@d844d9fc42f9:/ceph/build# dd if=/dev/zero of=/cephfs/test bs=4k count=2048
root@d844d9fc42f9:/ceph/build# setfattr -n ceph.file.layout.stripe_unit -v 1048576 /cephfs/test 
setfattr: /cephfs/test: Directory not empty

Simulating DB Node Partition

This is how we can “scale up” a DB node:

root@d844d9fc42f9:/ceph/build# time dd if=/dev/zero of=/cephfs/partition1 bs=1M count=1k
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 95.3735 s, 11.3 MB/s

real	1m35.380s
user	0m0.004s
sys	0m0.752s
root@d844d9fc42f9:/ceph/build# time mv /cephfs/partition1 /cephfs/tmp

real	0m0.018s
user	0m0.000s
sys	0m0.000s
root@d844d9fc42f9:/ceph/build# touch /cephfs/partition1
root@d844d9fc42f9:/ceph/build# setfattr -n ceph.file.layout.stripe_unit -v 1048576 /cephfs/partition1
root@d844d9fc42f9:/ceph/build# setfattr -n ceph.file.layout.object_size -v 1048576 /cephfs/partition1
root@d844d9fc42f9:/ceph/build# time cp /cephfs/tmp /cephfs/partition1 

real	1m39.647s
user	0m0.008s
sys	0m0.996s

We can verify how many objects are created:

root@d844d9fc42f9:/ceph/build# bin/rados -p cephfs_data_a ls | grep 1000000000c | wc -l
256
root@d844d9fc42f9:/ceph/build# bin/rados -p cephfs_data_a ls | grep 1000000000d | wc -l
1024

Each file has the same content but are striped over a different number of objects.

Jekyll theme inspired by researcher