Distributed filesystem

De Breizh-Entropy
Aller à la navigation Aller à la recherche

Introduction

Distributed filesystems are a way to distribute and/or replicate data over multiple hosts, connected via a network.

There are a lot of distributed filesystems out there, with various designs:

  • kernelspace or userspace
  • designed for LAN or WAN
  • ability to accomodate possibly volatile nodes (participants on regular hardware over a WAN)
  • fully P2P or more centralized (only one server storing metadata, etc)
  • run on trusted hosts only, or more open
  • always-connected or offline mode (with merge features)
  • use crypto to validate transaction
  • use crypto to communicate

The goals achieved are also very different from filesystem to filesystem:

  • high performance
  • high availability (some nodes may fail without shutting down the whole filesystem)
  • redundancy of data
  • aggregation of disk storage (scalable to petabyte sizes)
  • data consistency

Running a distributed filesystem on a darknet

We want to run a distributed filesystem over secure links such as OpenVPN or Tinc, on a darknet.

High-level goal

The goal is to provide an easy way to share and distribute data among participants, so that anybody in the network can write data, that will be then distributed across all hosts. More importantly, the data should still be available, even if some (or most) nodes fail.

This would provide a simple interface to exchange data between participants. Because building darknets is cool, but using them is better :)

Note that a network such as Freenet almost does that. There are a few differences, however:

  • anybody can join Freenet, while our system will run on top of a darknet (which requires making a secure link to at least one existing node to join the network). Trust issues are thus less important.
  • consequently, we do not adopt the « each user has its own data, that is replicated over the network » model, but rather « data isn't owned by anybody in particular, everything is common and accessible read/write ».

Requirements

More specifically, the following requirements should be met:

  • simple to setup (in userspace if possible)
  • not necessarily encrypted, since we run it over secure links
  • being able to scale on low-uplink hosts (eg. regular ADSL lines). Basically, if writing on a host means pushing data to all other hosts, it clearly won't scale. Something along the lines of bittorrent, where peers that already received some data can forward it to other peers, would be great.

Note that a distributed filesystem is not the only solution. We could also simply use a daemon that watches a given directory, and send/receive modifications of files with other hosts.

Comparison of existing distributed filesystems

Glusterfs

Glusterfs is a distributed filesystem that runs on FUSE. It aims at being reliable and high-performance (its authors advertise it as a filesystem suitable for a « cloud » infrastructure), while staying quite modular.

Glusterfs can thus run in a number of modes, including:

  • RAID0-like, where storage space of all nodes is aggregated to form a larger partition. When writing, files are sent to a random node.
  • RAID1-like, where data is replicated over all storage nodes in real-time.
  • striping mode, which seems to basically be « split each file into stripes, and distribute theses stripes accross storage nodes ». FIXME: does it allow a kind of erasure coding?
  • potentially other modes…

Ideally, we want to use erasure coding techniques and/or a RAID5-like mode, to ensure that nodes can be down without compromising the availability of the filesystem, but also to have more storage space available. It might however be a lot simpler to use a RAID0-like mode, with the possible drawbacks mentioned below.

In any case, each participant runs its own glusterd instance.

Resources

Pros

  • truly decentralized, since all storage nodes are equivalent:
    • there is no dedicated metadata server (like most distributed filesystems, including MooseFS, Lustre, …)
    • the filesystem can be mounted from any storage node
  • in RAID0-like mode: reading performance is optimal, since each node has a complete local copy of the data

Cons

  • in RAID0-like mode: writing can be very slow, since data must be sent to all other nodes. It can be mitigated if glusterfs allows a non-fully-connected topology (i.e. you don't make glusterfs links with everybody, but only with two or three other participants). It can be interesting to make these glusterfs links only when the participants are directly connected through Tinc, to avoid bandwidth bottlenecks.
  • configuration is a bit painful, especially regarding IPv6 (we still have socat if we really need to…)

Ceph

<description>

Resources

Pros

  • high-performance
  • stable and reliable
  • in-kernel support for Linux from 2.6.34
  • octopuses are cool :)

Cons

  • hard to setup (three kind of nodes, etc)
  • a bit overkill for a simple setup like this (only a few nodes over a WAN)
  • maybe more designed to run on a LAN than a WAN?
  • still based on a client/server model: to achieve full decentralization, we would need a cluster monitor, a metadata server and an object storage device on each participant's machine.

To develop

  • tahoe-fs
  • pvfs2
  • raid like (drdb , raid + nbd/iscsi)
  • home-made (unison)
  • Ceph
  • Lustre
  • Ivy