This is a collection of experiments that explore the CephFS striping strategy as a means to scale-up the storage on individual database nodes (i.e. DB nodes).
First, download, compile, and start Ceph. If you are starting on a clean machine, we suggest using our Docker images for building Ceph. This Docker container has all the dependencies for building Ceph:
msevilla@pl3:~$ git clone --recursive https://github.com/ceph/ceph.git
msevilla@pl3:~$ docker run --rm -it --privileged --volume `pwd`/ceph:/ceph --entrypoint /bin/bash cephbuilder/ceph:latest
root@0bb75695c80b:/ceph# mkdir build
root@0bb75695c80b:/ceph# cd build
root@0bb75695c80b:/ceph/build# cmake ..
root@d844d9fc42f9:/ceph/build# make -j24
The change of the command line prompt indicates we have dropped into a container. A description of what the Docker commands are doing is beyond the scope of this tutorial – but if you want more info, check out the docker-cephdev wiki.
Next, start a virtual cluster:
root@0bb75695c80b:/ceph/build# OSD=10 MON=1 MDS=1 ../src/vstart.sh -n -l --short
[ ... wait a couple minutes ...]
root@d844d9fc42f9:/ceph/build# bin/ceph -s
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
cluster 527fe5e0-c2ab-4fd2-ba6c-cfc6ad7970eb
health HEALTH_WARN
3 near full osd(s)
monmap e1: 1 mons at {a=127.0.0.1:6789/0}
election epoch 3, quorum 0 a
fsmap e5: 1/1/1 up {0=a=up:active}
mgr no daemons active
osdmap e17: 3 osds: 3 up, 3 in
flags nearfull,sortbitwise,require_jewel_osds,require_kraken_osds
pgmap v40: 24 pgs, 3 pools, 2148 bytes data, 20 objects
1028 GB used, 156 GB / 1185 GB avail
24 active+clean
I am not sure how to get rid of the near full OSD warning.
Finally, we can mount CephFS:
root@d844d9fc42f9:/ceph/build# mkdir /cephfs
root@d844d9fc42f9:/ceph/build# bin/ceph-fuse /cephfs
2016-10-27 23:51:34.545317 7f06968faec0 -1 init, newargv = 0x555a4e248ae0 newargc=11
ceph-fuse[18613]: starting ceph client
ceph-fuse[18613]: starting fuse
root@d844d9fc42f9:/ceph/build# mount
[... snip ...]
ceph-fuse on /cephfs type fuse.ceph-fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
root@d844d9fc42f9:/ceph/build# touch /cephfs/test
root@d844d9fc42f9:/ceph/build# ls /cephfs/
test
Files are striped across 4MB chunks and the file system namespace is managed by the metadata cluster.
root@d844d9fc42f9:/ceph/build# touch /cephfs/empty
root@d844d9fc42f9:/ceph/build# echo "hi" > /cephfs/file1
root@d844d9fc42f9:/ceph/build# dd if=/dev/zero of=/cephfs/file2 bs=4k count=2048
root@d844d9fc42f9:/ceph/build# ls -alh /cephfs
total 8.1M
drwxr-xr-x 1 root root 0 Oct 28 00:09 .
drwxr-xr-x 49 root root 4.0K Oct 27 23:51 ..
-rw-r--r-- 1 root root 0 Oct 28 00:03 empty
-rw-r--r-- 1 root root 3 Oct 28 00:04 file1
-rw-r--r-- 1 root root 8.0M Oct 28 00:09 file2
root@d844d9fc42f9:/ceph/build# bin/rados lspools
rbd
cephfs_data_a
cephfs_metadata_a
root@d844d9fc42f9:/ceph/build# bin/rados ls -p cephfs_data_a
10000000005.00000001
10000000005.00000000
10000000004.00000000
Empty files have no corresponding objects but file1
and file2
are striped
over 4MB objects. The format of the object names are:
<fileID>
.<partitionID>
. The <partitionID>
s are sequential numbers that
correspond to objects. file2
is larger than 4MB so it is striped over
multiple objects.
We can change the file properties using extended attributes. Files default to a stripe width of 4MB objects:
root@d844d9fc42f9:/ceph/build# touch /cephfs/file3
root@d844d9fc42f9:/ceph/build# getfattr -n ceph.file.layout.object_size /cephfs/file3
# file: cephfs/file3
ceph.file.layout.object_size="4194304"
We can change the object size with:
root@d844d9fc42f9:/ceph/build# setfattr -n ceph.file.layout.stripe_unit -v 1048576 /cephfs/file3
root@d844d9fc42f9:/ceph/build# setfattr -n ceph.file.layout.object_size -v 1048576 /cephfs/file3
root@d844d9fc42f9:/ceph/build# bin/rados -p cephfs_data_a ls
root@d844d9fc42f9:/ceph/build# dd if=/dev/zero of=/cephfs/file3 bs=4k count=2048
2048+0 records in
2048+0 records out
8388608 bytes (8.4 MB) copied, 0.497699 s, 16.9 MB/s
root@d844d9fc42f9:/ceph/build# bin/rados -p cephfs_data_a ls
10000000007.00000001
10000000007.00000006
10000000007.00000004
10000000007.00000002
10000000007.00000000
10000000007.00000005
10000000007.00000003
10000000007.00000007
Unfortunately, you cannot have mixed object stripe sizes:
root@d844d9fc42f9:/ceph/build# dd if=/dev/zero of=/cephfs/test bs=4k count=2048
root@d844d9fc42f9:/ceph/build# setfattr -n ceph.file.layout.stripe_unit -v 1048576 /cephfs/test
setfattr: /cephfs/test: Directory not empty
This is how we can “scale up” a DB node:
root@d844d9fc42f9:/ceph/build# time dd if=/dev/zero of=/cephfs/partition1 bs=1M count=1k
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 95.3735 s, 11.3 MB/s
real 1m35.380s
user 0m0.004s
sys 0m0.752s
root@d844d9fc42f9:/ceph/build# time mv /cephfs/partition1 /cephfs/tmp
real 0m0.018s
user 0m0.000s
sys 0m0.000s
root@d844d9fc42f9:/ceph/build# touch /cephfs/partition1
root@d844d9fc42f9:/ceph/build# setfattr -n ceph.file.layout.stripe_unit -v 1048576 /cephfs/partition1
root@d844d9fc42f9:/ceph/build# setfattr -n ceph.file.layout.object_size -v 1048576 /cephfs/partition1
root@d844d9fc42f9:/ceph/build# time cp /cephfs/tmp /cephfs/partition1
real 1m39.647s
user 0m0.008s
sys 0m0.996s
We can verify how many objects are created:
root@d844d9fc42f9:/ceph/build# bin/rados -p cephfs_data_a ls | grep 1000000000c | wc -l
256
root@d844d9fc42f9:/ceph/build# bin/rados -p cephfs_data_a ls | grep 1000000000d | wc -l
1024
Each file has the same content but are striped over a different number of objects.
Jekyll theme inspired by researcher