Five years of vSan, Gluster, and Ceph
We used to have a headache at least once a year - ‘we have to reboot/patch the SAN’. Whether isilon, symmetrix, clariion. shark or just plain old NFS on ZFS, it’s a pain when you have 100s of servers running on it. We were always surprised that even paying half a million you still had to interrupt services. Imagine the joy when distributed file systems came of age and that was (we thought) a thing of the past.
I’ve had three such distributed files systems in production over the last 5 years. VMWare vSAN, Gluster, and Ceph.
vSAN has been the focus, but it’s somewhat of a black box. Anything that goes wrong beyond just drive replacement and you call the vendor. And it’s been surprisingly susceptible to data loss. When you have over 100 spinning drives, you expect to replace a few a year. But every time you do there’s a chance that the recovery fails and you have to restore some VMs from backup. You’d think it would become more pro-active during upgrades over the years.
Ceph has been running almost as long, and it’s very much the opposite. All sorts of things to look at. Enough to make you nervous if you don’t know it well, and it takes a significant investment of time to know it. Only 25 drives, but a few replacements and a host failure later, only one data-loss event.
Gluster has been smooth sailing. It’s simpler and easier to work with. You have to enable features such as checksums, but ultimately it runs without complaint and more reliably. No issues - even ran with a downed host for several months. Worst cast and you still have the data natively on disk, unlike the object-based storage of the others. Shame that RedHat/IBM has shifted focus, but market forces, I guess.
Note: I don’t actually remember the shark ever, ever needing anything. But that was back when there were mainframes and ties everywhere. I don’t know why we had to interrupt services exactly - I think our storage team was ‘following vendor instructions’.