One of the reasons I set up my cluster was that I’m running out of space on my NAS. I don’t want to buy a whole new chassis, and while I could have put individual file shares on each cluster node, that would be both inconvenient and not provide any data redundancy without a lot of brittle home-rolled hacks to rsync
data from node to node. And since distributed file systems are a thing, I’d rather not resort to hacks.
Alternatives Considered
To make a long story short, I considered ceph, glusterfs, lizardfs & moosefs.
ceph
Ceph’s published specs show that it basically won’t fit in my ODROID nodes. They say metadata servers need 1GB, OSDs need another 500MB, and the HC2s in my cluster only have 2GB of onboard RAM, and I’m running k8s on them too, so Ceph is out.
GlusterFS
GlusterFS looked nice at first, but it’s both less flexible (you can’t add single drives to the storage cluster) and less performant (by a factor of 2 compared to MooseFS) on my HC2 hardware.
LizardFS
LizardFS is a fork of MooseFS, and when I was trying to install it in my cluster I had a hard time finding debs for it. When I was looking for them online, I found a lot of complaints about performance issues on ARM, so that eliminated it from consideration.
MooseFS
There were a lot of things to like about MooseFS for my use case on my hardware.
- There were prebuilt ARM debs on their site, in a handy PPA.
- It was twice as fast on my hardware as GlusterFS.
- I can add individual drive bricks to the cluster, even though my storage policies (more on them below) all require multiple replicas of data in the filesystem they’re applied to.
- The memory requirements are small enough that I can run it on the same nodes that I’m running kubernetes on.
- It dynamically balances disk usage across the bricks - when I added a third brickserver to my cluster, moosefs shuffled replica chunks over to it until all three servers had a roughly equal usage percentage.
- It allows custom storage policies:
- You can label storage bricks (say SSD as label A, spinning disks as label B) and use the labels in policy definitions.
- It’s flexible - you can create policies with different replication requirements and assign those on a per-directory or even per-file basis.
- By referring to brick labels, you can do things like create a policy that requires that at file creation, one replica be written to SSD and one to spinning disk, and then after that initial write is complete (so it can report back to the writing process that the write is done), that it then try to make sure there is a third copy so that there are two copies on spinning disks and one on SSD.
- You can make policies that change replication policies after user-specified amounts of time - so maybe your policy is that new files get one copy on SSD and 2 on spinning disk, but after 30 days, switch to one copy on a regular spinning disk and two on bigger slower drives.
Installing
You’re going to need a moosefs master, chunkservers for each machine that hosts drives, and should run a metadata backup server. You may also want to run the cgi to visualize the cluster status.
Pre-Requisites
- Static IPs for the cluster nodes, preferably with DNS entries. Setting this up is out of scope for this post.
- Decide which nodes will just be chunkservers, which will be the master, and optionally which are going to be metadata servers and cgi servers. You can run the master, metadata and cgi servers on machines that are also chunkservers.
All servers
Add the MooseFS PPA
- Add a file,
/etc/apt/sources.list.d/moosefs.list
, with the following contents
deb http://ppa.moosefs.com/moosefs-3/apt/raspbian/jessie jessie main
-
Run
wget -O - https://ppa.moosefs.com/moosefs.key | sudo apt-key add -
to add the moosefs PPA key -
Run
apt-get update
Master Server
Install the master server software on one of your nodes with apt install moosefs-master
. Do this first, the chunkservers will need to communicate with it.
Optionally install the cgi server with apt install moosefs-cgiserv
Configure /etc/mfs/mfsmaster.cfg
and /etc/mfs/mfsexports.cfg
. Start by copying /etc/mfs/mfsmaster.cfg.sample
and /etc/mfs/mfsexports.cfg.sample
.
Set the master software to start on boot with systemctl enable moosefs-master
. If you installed the cgi server, enable it too with systemctl enable moosefs-cgiserv
.
Start the master and cgi server with systemctl start moosefs-master && systemctl start moosefs-cgiserv
.
Chunkservers
For each of your chunkservers, take the following steps:
Install moosefs software
Install the software with apt install moosefs-chunkserver
.
Configure which drives to use for storage
Make a directory to store the moosefs data. On my HC2 instances, I mount the data drives on /mnt/sata
, and keep the raw mfs data in /mnt/sata/moosefs
Configure /etc/mfs/mfshdd.cfg
. There’s an example in /etc/mfs/mfshdd.cfg.sample
- add one line per directory to be shared.
In my case, I want to keep 50 gigs free on the drive, so my entry in mfshdd.cfg
is
/mnt/sata/moosefs -50GB
Configure the chunkserver options
Copy /etc/mfs/mfschunkserver.cfg.sample
to /etc/mfs/mfschunkserver.cfg
and edit it to meet your needs - at a minimum, you’ll need to set MASTER_HOST = yourmaster.example.com
Enable and start the chunkserver
Set up the chunkserver to start at boot, and start it now -
systemctl enable moosefs-chunkserver && systemctl start moosefs-chunkserver
Mounting the filesystem
Now that the chunkservers are talking to the master, you can set up automounting it on your nodes.
First, install the client software - sudo apt install -y moosefs-client
Second, make a mountpoint. On my nodes, I’m using /data/squirrel
, so sudo mkdir -p /data/squirrel
Finally, create a systemd unit file so that the filesystem mounts every boot. I want to be able to use hostPath
directives in my kubernetes deployments, so I want it to start before docker
and kubelet
. Make a file, /etc/systemd/system/yourcluster-mfsmount.service
with the following content (replace /mountpoint
with whatever mountpoint you’re using):
# Original source: https://sourceforge.net/p/moosefs/mailman/message/29522468/
[Unit]
Description=MooseFS mounts
After=syslog.target network.target ypbind.service moosefs-chunkserver.service moosefs-master.service
Before=docker.service kubelet.service
[Service]
Type=forking
TimeoutSec=600
ExecStart=/usr/bin/mfsmount /mountpoint -H YOUR_MASTER_SERVER
ExecStop=/usr/bin/umount /mountpoint
[Install]
WantedBy=multi-user.target
Enable it so it starts every boot:
systemctl enable yourcluster-mfsmount && systemctl start yourcluster-mfsmount