Rethinking ‘dump’
Bear Giles | May 18, 2011Backup programs are interesting critters. You need one that backs up what you do but you also end up only doing things that you can back up. The most commonly used formats, tar and zip, are great for what they do but there are some serious limitations for a modern system.
- Sparse File – a sparse file allows you to create a file that seems to be quite large but that only actually consumes disk space when you write data to it. A good modern use is creating a disk image – you can create a lage sparse file, put a filesystem on it, and only require a bit more disk space than the files you put in it. The file will expand to the nominal size when you burn it to a CD or DVD but you want your backups to be more memory-efficient.
- Extended Attributes – boolean extended attributes beyond the standard Unix discretionary attributes (e.g., read/write/execute). Some examples are:
- immutable – nobody can modify the file, not even root.
- append-only – the file cannot be modified other than appending data to it. You’ll often see this attribute on log files.
- secure undelete – the disk sectors are zeroed out when the file is deleted
- do not backup – don’t back up this file. You’ll often see this attribute on private key files.
- Access Control Lists – extended attributes that provide finer access control than the standard Unix controls. E.g., you can say that a file is read-only except for user ids 1003, 1073 and 1083 who can read it and user 1077 can’t read the file at all.
- Mandatory Access Control Labels (SELinux) – these are mandatory access control labels. E.g., you can label a file to say that it’s used by the web server and all of the policies associated with the web server should be applied to it.
Many if not all of these attributes are being standardized but there’s no support in (standard) tar and zip formats. We don’t need this functionality on the typical home system but they can be critical on servers.
So what’s wrong with ‘dump’ for Ext2/3/4 filesystems?
- No Indexing – it should be possible to quickly determine whether a file is present in an archive and to retrieve it. ‘Dump’ provides limited support with a proprietary data format but it’s not easily to create an index spanning multiple archives.
- No Error Detection – there is no way to determine that an archived file has been corrupted.
- No Encryption – there is no native encryption for the archives. This is important if you write your archive directly to tape or are unable to load the complete archive for decryption before restoring files.
There is one additional issue when performing disk-based backups – a modern kernel will cache a great deal of information and the raw block device may not be fully consistent. LVM-based partitions will help tremendously – we can sync(1) the filesystem and immediately create a snapshot. Applications may still have unwritten caches but we can’t do anything about that without taking the system down to a quiet state and unmounting the partition.
Important: do not attempt to create file-based backups of running databases! Use the backup program provided by the database if you need to back up a running database.
So why doesn’t anyone do something about this? Funny you should ask…. In fact I’ve started working on this and hope to submit a patch to the maintainers soon. The first patch will provide a SQLite index to the archive in addition to the existing format. The second patch will provide error detection.
Encryption support is much more difficult. It’s easy to do something, but it’s also easy to screw up and have a much weaker system than you realized.