Replaying CephFS Journal Events

Written by Michael Sevilla

In this post we will use the CephFS journal tool to save metadata updates from one cluster and replay them on another. While this is usually used when the file system is corrupt we look at it here to learn a little more about the CephFS fault tolerance model.

CephFS comes packaged with a journal tool for disaster recovery. Metadata servers stream a journal of metadata updates into RADOS and these events are later materialized in RADOS as a metadata store. The metadata store has all the information about files, including the hierarchy and organization of the file system. CephFS streams a journal of updates into RADOS for two reasons:

fault tolerance: if the metadata server fails the journal can be replayed to return to a consistent state
performance: the metadata server goes as fast as it can write to the journal; writes into RADOS are sequential; and events are “trimmed” if they are redundant or irrelevent (e.g., create a file but immediately delete it)

Setup

First we create a single node Ceph cluster. For a more detailed description of how to do this, check out our CephFS Striping Strategy Blog. Assuming we have a Ceph build container running, we start Ceph, mount CephFS, and create some metadata:

msevilla@pl3:~/ceph$ docker exec -it ceph-dev /bin/bash
root@14902c1e7961:/ceph# cd build
root@14902c1e7961:/ceph/build# ../src/vstart.sh -n -l -k
[... snip ...]
root@14902c1e7961:/ceph/build# mkdir /cephfs; bin/ceph-fuse /cephfs
2017-01-06 20:00:21.703375 7f6b5dfadec0 -1 init, newargv = 0x5601df8f8ea0 newargc=11
ceph-fuse[17401]: starting ceph client
ceph-fuse[17401]: starting fuse
root@14902c1e7961:/ceph/build# mkdir /cephfs/testdir
root@14902c1e7961:/ceph/build# touch /cephfs/file_root.txt
root@14902c1e7961:/ceph/build# touch /cephfs/testdir/file.txt

Exporting the Journal

We can look at the journal contents with:

root@14902c1e7961:/ceph/build# bin/cephfs-journal-tool event get list
0x4004f3 SUBTREEMAP_TEST:  ()
0x40086f UPDATE:  (mkdir)
  testdir
0x401376 UPDATE:  (openc)
  file_root.txt
0x401d45 OPEN:  ()
  testdir
0x4024a9 UPDATE:  (openc)
  testdir/file.txt
0x402c18 SUBTREEMAP_TEST:  ()
0x402f94 UPDATE:  (scatter_writebehind)
0x4033d0 SUBTREEMAP_TEST:  ()
0x40374c UPDATE:  (cap update)
  file_root.txt
0x403d7e SUBTREEMAP_TEST:  ()
0x4040fa UPDATE:  (cap update)
  testdir/file.txt
0x404848 SUBTREEMAP_TEST:  ()
0x404bc4 UPDATE:  (scatter_writebehind)
0x405000 SUBTREEMAP_TEST:  ()

Here we see all of our operations which are redundantly saved in RADOS. Updates are operations on files ore directories; the capability and scatter writebehind updates are for cache coherence between clients to ensure exclusivity. Now we save the journal so that we can apply the events to a different Ceph cluster:

root@14902c1e7961:/ceph/build# bin/cephfs-journal-tool journal export backup.bin
journal is 4194304~19396
wrote 19396 bytes at offset 4194304 to blah
NOTE: this is a _sparse_ file; you can
	$ tar cSzf blah.tgz blah
      to efficiently compress it while preserving sparseness.

The journal is written in a binary format so we can only really read the header. The header is updated every once in a while so that metadata servers can replay events more quickly but most operations require a full journal scan. Next we will apply these journal events to a new cluster.

Merging Updates into a New Cluster

Now we restart Ceph:

root@14902c1e7961:/ceph/build# umount /cephfs
root@14902c1e7961:/ceph/build# ../src/vstart.sh -n -l -k
[... snip ...]
root@14902c1e7961:/ceph/build# mkdir /cephfs; bin/ceph-fuse /cephfs
2017-01-06 20:00:21.703375 7f6b5dfadec0 -1 init, newargv = 0x5601df8f8ea0 newargc=11
ceph-fuse[17401]: starting ceph client
ceph-fuse[17401]: starting fuse

First, we make sure that the journal and file system are empty. Because we deployed a new cluster, all the file system data and metadata should be gone.

root@14902c1e7961:/ceph/build# bin/cephfs-journal-tool event get summary
Events by type:
  SUBTREEMAP: 1
Errors: 0
root@14902c1e7961:/ceph/build# ls -alh /cephfs
total 4.5K
drwxr-xr-x  1 root root    0 Jan  6 22:08 .
drwxr-xr-x 51 root root 4.0K Jan  6 21:09 ..

Next we load the events from the binary we saved earlier:

root@14902c1e7961:/ceph/build# bin/cephfs-journal-tool journal import backup.bin
root@14902c1e7961:/ceph/build# bin/cephfs-journal-tool event get summary
Events by type:
  OPEN: 1
  SESSION: 1
  SUBTREEMAP: 1
  SUBTREEMAP_TEST: 8
  UPDATE: 6
Errors: 0

Next, replay the events from the journal onto the metadata store. There are two ways to do this, both of which manipulate metadata directly. Setting the output as list we can see which updates are applied:

apply: blindly apply all updates
recover_entries: apply updates if and only if their version is newer than the metadata store version

We use apply because we want to overwrite entries in the metadata store:

root@14902c1e7961:/ceph/build# bin/cephfs-journal-tool event apply list
0x400000 SUBTREEMAP:  ()
0x40037c SESSION:  ()
0x4004f3 SUBTREEMAP_TEST:  ()
0x40086f UPDATE:  (mkdir)
  testdir
0x400ffa SUBTREEMAP_TEST:  ()
0x401376 UPDATE:  (openc)
  file_root.txt
0x4019c9 SUBTREEMAP_TEST:  ()
0x401d45 OPEN:  ()
  testdir
0x40212d SUBTREEMAP_TEST:  ()
0x4024a9 UPDATE:  (openc)
  testdir/file.txt
0x402c18 SUBTREEMAP_TEST:  ()
0x402f94 UPDATE:  (cap update)
  file_root.txt
0x4035c6 SUBTREEMAP_TEST:  ()
0x403942 UPDATE:  (cap update)
  testdir/file.txt
0x404090 SUBTREEMAP_TEST:  ()
0x40440c UPDATE:  (scatter_writebehind)
0x404848 SUBTREEMAP_TEST:  ()

Now we clear the journal (otherwise we get a nasty segfualt in the metadata server):

root@14902c1e7961:/ceph/build# bin/cephfs-journal-tool journal reset
old journal was 4194304~19396
new journal start will be 8388608 (4174908 bytes past old end)
writing journal head
writing EResetJournal entry
done

Finally, we fail the metadata server so that next standby metadata server will replay the events from the journal stored in RADOS. vstart sometimes brings up the metadata servers in different orders, so make sure to fail the active metadata server:

root@14902c1e7961:/ceph/build# bin/ceph mds fail <active metadata server>
failed mds gid 4115

To verify, we mount CephFS and check to see if our namespace is recovered:

root@14902c1e7961:/ceph/build# bin/ceph-fuse /cephfs
root@14902c1e7961:/ceph/build# ls -alh /cephfs
total 5.0K
drwxr-xr-x  1 root root    0 Jan  6 21:18 .
drwxr-xr-x 51 root root 4.0K Jan  6 21:09 ..
-rw-r--r--  1 root root    0 Jan  6 21:18 file_root.txt
drwxr-xr-x  1 root root    0 Jan  6 21:18 testdir

We see our the files we created from the old file system in our new deployment. Notice that the modification times are new. This is because they timestamps are from whenever the apply occurred.

Conclusion

We showed how to save the journal of metadata updates to a file using the CephFS journal tool and how to replay them on another cluster. This has a lot of uses for when something goes wrong, like a corrupt file system, but we are interested in it here because our next project will look at decoupling the file system namespace, doing work, and merging updates back into the global namespace. The CephFS journal tool will be a big part of that work… stay tuned for updates.

Jekyll theme inspired by researcher