Wednesday, January 20. 2010
- Avoid Traditional system designs
- single server bottlenacks
- Avoid symmetric shared disk (cumbersome, hard to scale)
ceph is one of those things that fills one with the kind of “gimmegimmegimme” enthusiasm.
Key Design Point
Segregate Data and Metadata (Object based storage)
- Store objects rather than blocks. Alphanumerc name, datablob, named attributes.
- Objects are in a flat name space in obect pools.
- Cluster of servers store all data objects.
- Serparate metadata cluster - security etc.
- The metadata server stays out of the IO path to avoid becoming a bottleneck.
- Metadata not needed to find objects, it’s encoded in the name.
- Rather than using allocation tables, objects are found by hash functions; the name is the location. This scales trivially for perfromance, but creates problems when you add devices.
- Ceph uses an algorithm called crush to stabilise mapping.
- Not only location but policy is encoded, such as the number of replicas and the locational constraints.
Files are broken into 4MB objects and mapped out to placement groups, and then hashed with crush. Pseudo random, statistically uniform distribution.
Fast, reliable (span failure domains), stable (moving devices creates minimal moves).
Reliable Distributed StoragePOSIX filesystems
- Create filesystem hierarchies via the cmds daemons, storing metadata remotely. Memory hungry.
- They’re dynamic, load balancing clusters.
- ceph embeds inodes with filenames and filedata. Hardlinks are provided for with seperate data.
- Metadata servers maintain journals. Journals are allowed to grow very large in order to run more efficiently in terms of physical IOs.
- Directory trees are forced to span storage servers to improve parallelism within a POSIX filesystem. This also helps with avoiding performance hits on node failure.
- Recursive accounting has subtree usage accounting, so directories are accounted for correctly, for example. No more du -ks | sort -n
- Extended attrs allow for location exposure, redundant copies and so on.
- Snapshots: fine-grained, snapshot individual directories, via a hidden directory. Leverages copy-on-write.
- Kernel driver, userspace client via FUSE as well as libceph and client modules for libceph.
State and future
- Stable, not production-ready.
- Needs more users before full merge.
- Kerberos like security.
- Testing, testing, testing.
- Linux block device.
- KVM/QEMU storage driver at the VM level.
- Alternative replication, with parallel links across datacentres.
Q&A
- Filesystem stores are in pools, you can run multiple pools.
- Not currently suited for broad distribution. Consistency is currently potentially sacrificed, with manual fixes sometimes needed if nodes drop and raise.
- Best suited to huge amounts of data like video.
- OSDs are tightly coupled to btrfs. May go away as a dependency.
- Deduping is not supported, but not currently supported. btrfs can also help with the problem.
- Full POSIX is desired.
- Upcoming beta testing with a lab cluster, dreamhosting.com (sign up with ceph for a year).
- Rebalancing happens in proportion to news devices being added, when they’re added.
- Crush rules could allow you to do multi-tier storage.
|
A good day; things were interesting across the board, and I even asked questions and bugged speakers. I ended up having lunch with Mike, Selena Deckelmann, and others; Selena is a good presenter, and fun company for lunch. Every time slot had something i
Tracked: Jan 20, 21:23