Marc's Public Blog

2014/03/19 Btrfs Tips: Btrfs Scrub and Btrfs Filesystem Repair

π 2014-03-19 01:01 in Btrfs, Linux

You mostly don't need fsck on btrfs

Btrfs does not have a really useful fsck, which many people think is horrible and make btrfs unsafe. In real life, there are many ways to mount btrfs filesystems with problem, or extra data out of them if they are too corrupted to mount.

In real life, due to how btrfs is layed out, it's not succeptible to corruption issues or things that are out of sync between locations like ext2-3-4 are, so you'll find that in normal operation you would need fsck for ext2/3/4 to clean up a few things from time to time, but this is just not needed with btrfs. Now, for very unexpected things and bad corruption, obviously you have backups right? but if you'd like to know what was corrupted and whether a restore from backup recovery is needed, or recover new data since your last backup, read on.

Run online fsck nightly or weekly, and look for errors that btrfs is noticing and reporting in syslog:

btrfs scrub is actually an online fsck everyone should run. It cannot fix anything unless you have raid1 (raid5/6 support for scrubbing and fixing hasn't been added yet as of kernel 3.14). Note that because data is checksummed, it will find data corruption, but as a result it takes longer to run since it needs to check all your data blocks too. This does put load on your filesystem and machine. With my scrubbing script below, you'll even get a list of files that are corrupted, if any.

use sec.pl (apt-get install sec on debian or http://simple-evcorr.sourceforge.net/ ) to report warnings or errors reported in syslog. See the example config file below.

Mounting filesystems with problems, try in order:

mount -o recovery,ro (for when regular mount isn't working). Note, you have to use ro or it will give you a misleading error:

root@polgara:~# mount -o recovery /dev/mapper/crypt /mnt/mnt8
mount: /dev/mapper/crypt already mounted or /mnt/mnt8 busy
root@polgara:~# mount -o recovery,ro /dev/mapper/crypt /mnt/mnt8
root@polgara:~#

mount -o degraded (for raid5/6 with a missing drive)

btrfs-zero-log can help mount a filesystem if the last blocks btrfs wrote before a crash were corrupted or out of order due to bad hardware or other bugs. See https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#I_can.27t_mount_my_filesystem.2C_and_I_get_a_kernel_oops.21 In debian, I made sure this was part of the initramfs so that you can fix this problem and mount your root filesystem if you had an unclean shutdown and bad hardware that messed up the last blocks that were written.

Getting data off a filesystem you can't mount:

btrfs restore. See https://btrfs.wiki.kernel.org/index.php/Restore

btrfs check --repair, aka btrfsck. See http://www.phoronix.com/scan.php?page=news_item&px=MTA2MDI and https://btrfs.wiki.kernel.org/index.php/Btrfsck . btrfsck is known not to do great work, but it could be useful nonetheless. Apparently --init-csum-tree allows to then mount a filesystem with corrupted blocks with -nodatasum.

How to configure sec, event correlator to report btrfs filesystem errors or warnings

After installing sec.pl (apt-get install sec on debian or http://simple-evcorr.sourceforge.net/ ), install the 2 config files below.

This is not foolproof, it relies on a regex of known messages that are ok, and reports all unknown ones. You can extend the forward looking negative regex as needed.


polgara:~# cat /etc/default/sec 
#Defaults for sec
RUN_DAEMON="yes"
DAEMON_ARGS="-conf=/etc/sec.conf -input=/var/log/syslog -pid=/var/run/sec.pid -detach -log=/var/log/sec.log"
polgara:~# cat /etc/sec.conf
# http://simple-evcorr.sourceforge.net/man.html
# http://sixshooter.v6.thrupoint.net/SEC-examples/article.html
# http://sixshooter.v6.thrupoint.net/SEC-examples/article-part2.html

type=SingleWithSuppress
ptype=RegExp
pattern=(?i)kernel.*btrfs: (?!disk space caching is enabled|use ssd allocation|use .* compression|unlinked .* orphans|turning on discard|device label .* devid .* transid|detected SSD devices, enabling SSD mode|has skinny extents|device label|creating UUID tree|checking UUID tree|setting .* feature flag|bdev.* flush 0, corrupt 0, gen 0)
window=60
desc=Btrfs unexpected log
action=pipe '%t: $0' /usr/bin/mail -s "sec: %s" root

Daily or weekly btrfs scrub

This is a must have with btrfs, btrfs scrub. Note that it is disk intensive since it checks everything for consistency (including data blocks), so on a server it does add load and it could take over a day ot run if you have terabytes of data.

If you don't have shlock, install inn, copy shlock out of it, and then delete it :) (or you can remove shlock from the script, it's not vital).

The up to date version of this script is at http://marc.merlins.org/linux/scripts/btrfs-scrub


#! /bin/bash

# By Marc MERLIN <marc_soft@merlins.org> 2014/03/20
# License: Apache-2.0

which btrfs >/dev/null || exit 0

export PATH=/usr/local/bin:/sbin:$PATH

# bash shortcut for `basename $0`
PROG=${0##*/}
lock=/var/run/$PROG

# shlock (from inn) does the right thing and grabs a lock for a dead process
# (it checks the PID in the lock file and if it's not there, it
# updates the PID with the value given to -p)
# You can replace this with another lock program if you prefer or even remove
# the lock.
if ! shlock -p $$ -f $lock; then
    echo "$lock held, quitting" >&2
    exit
fi

if which on_ac_power >/dev/null 2>&1; then
    ON_BATTERY=0
    on_ac_power >/dev/null 2>&1 || ON_BATTERY=$?
    if [ "$ON_BATTERY" -eq 1 ]; then
	exit 0
    fi
fi

FILTER='(^Dumping|balancing, usage)'
test -n "$DEVS" || DEVS=$(grep '\<btrfs\>' /proc/mounts | awk '{ print $1 }' | sort -u)
for btrfs in $DEVS
do
    tail -n 0 -f /var/log/syslog | grep "BTRFS: " | grep -Ev '(disk space caching is enabled|unlinked .* orphans|turning on discard|device label .* devid .* transid|enabling SSD mode|BTRFS: has skinny extents|BTRFS: device label)' &
    mountpoint="$(grep "$btrfs" /proc/mounts | awk '{ print $2 }' | sort | head -1)"
    logger -s "Quick Metadata and Data Balance of $mountpoint ($btrfs)" >&2
    # Even in 4.3 kernels, you can still get in places where balance
    # won't work (no place left, until you run a -m0 one first)
    btrfs balance start -musage=0 -v $mountpoint 2>&1 | grep -Ev "$FILTER"
    btrfs balance start -musage=20 -v $mountpoint 2>&1 | grep -Ev "$FILTER"
    # After metadata, let's do data:
    btrfs balance start -dusage=0 -v $mountpoint 2>&1 | grep -Ev "$FILTER"
    btrfs balance start -dusage=20 -v $mountpoint 2>&1 | grep -Ev "$FILTER"
    # And now we do scrub. Note that scrub can fail with "no space left
    # on device" if you're very out of balance.
    logger -s "Starting scrub of $mountpoint" >&2
    echo btrfs scrub start -Bd $mountpoint
    ionice -c 3 nice -10 btrfs scrub start -Bd $mountpoint
    pkill -f 'tail -n 0 -f /var/log/syslog'
    logger "Ended scrub of $mountpoint" >&2
done

rm $lock

2014/03/20 Btrfs Tips: ACPI S3 Sleep aka Suspend And Btrfs Scrub

π 2014-03-20 01:01 in Btrfs, Linux

Btrfs and S3 Sleep (Suspend)

As of kernel 3.14, btrfs doesn't do the right things to freeze and allow a laptop or machine to go to ACPI sleep.

This is discussed in more details in this thread: http://comments.gmane.org/gmane.comp.file-systems.btrfs/33106

For now, I am using this crude workaround. I added this in /etc/acpi/sleep.sh:
awk '/btrfs/ { print $1 }' /proc/mounts | sort -u | while read fs; do btrfs scrub cancel $fs; done

This could easily be improved by running scrub status, pausing scrubs that are running instead of cancelling them, and resuming them after coming back from sleep.

2014/03/21 Btrfs Tips: How To Setup Netapp Style Snapshots

π 2014-03-21 01:01 in Btrfs, Linux

How to get Netapp[tm]-like snapshots with BTRFS

Filesystem snapshots are something you'll never want to live without once you've had them. I learned about them in 1997 when I was working for Network Appliance, but unfortunately due to software patents, they managed to pevent most others from enjoying them until more recently.

Linux did have crappy snapshots if you used LVM, but LVM and LVM2 snapshots were both so bad performance-wise (they were not meant to be long lived, and even a single snapshot would slow your filesysem down significantly, never mind multiple levels).

If you can't use btrfs, but you still want historical snapshots, you should look into LVM thin provisioning which are newer as of kernel 3.4. They are suppposed to be faster for multiple levels of snapshots. Considering how bad LVM2 is, I'm sure they are faster no matter what, but I didn't have a use for them now that I'm using btrfs, so I can't speak of their performance. You can read up more here:

https://www.kernel.org/doc/Documentation/device-mapper/thin-provisioning.txt

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/thinly-provisioned_snapshot_volumes.html

Back to btrfs, I use the recommended layout of putting all filesystems in a subvolume. I this:

/mnt/btrfs_pool1 -> actual btrfs filesystem

/mnt/btrfs_pool1/root -> gets mounted to / with -o subvol=root

/mnt/btrfs_pool1/usr -> gets mounted to /usr with -o subvol=usr

/mnt/btrfs_pool1/var -> gets mounted to /var with -o subvol=var

After running my script, I get multiple levels of snapshots, I'll show only root here for brevity. With this you can restore files from older versions 3 hours ago, 3 days ago, or 3 weeks ago. Here is the partial output of /mnt/btrfs_pool1:

drwxr-xr-x 1 root root  370 Feb 24 10:38 root

drwxr-xr-x 1 root root  370 Feb 24 10:38 root_daily_20140316_00:05:01

drwxr-xr-x 1 root root  370 Feb 24 10:38 root_daily_20140318_00:05:01

drwxr-xr-x 1 root root  370 Feb 24 10:38 root_daily_20140319_00:05:01

drwxr-xr-x 1 root root  370 Feb 24 10:38 root_daily_20140320_00:05:00

drwxr-xr-x 1 root root  370 Feb 24 10:38 root_hourly_20140316_22:33:00

drwxr-xr-x 1 root root  370 Feb 24 10:38 root_hourly_20140318_00:05:01

drwxr-xr-x 1 root root  370 Feb 24 10:38 root_hourly_20140319_00:05:01

drwxr-xr-x 1 root root  370 Feb 24 10:38 root_hourly_20140320_00:05:00

drwxr-xr-x 1 root root  336 Feb 19 21:40 root_weekly_20140223_00:06:01

drwxr-xr-x 1 root root  370 Feb 24 10:38 root_weekly_20140302_00:06:01

drwxr-xr-x 1 root root  370 Feb 24 10:38 root_weekly_20140309_00:06:01

drwxr-xr-x 1 root root  370 Feb 24 10:38 root_weekly_20140316_00:06:01

Note that snapshots are not backups, they give you a view into the past if your filesystem hasn't been corrupted and the disk you were using, didn't die.

I then have a cronjob that runs this:

0 * * * * root btrfs-snaps hourly 3 | egrep -v '(Create a snapshot of|Will delete the oldest|Delete subvolume|Making snapshot of )'
2 0 * * * root btrfs-snaps daily  4 | egrep -v '(Create a snapshot of|Will delete the oldest|Delete subvolume|Making snapshot of )'
3 0 * * 0 root btrfs-snaps weekly 4 | egrep -v '(Create a snapshot of|Will delete the oldest|Delete subvolume|Making snapshot of )'

This is using the script, btrfs-snaps, for which I'll paste a most likely outdated copy here:

#!/bin/bash

# By Marc MERLIN <marc_soft@merlins.org>
# License Apache 2.0.

# This lets you create sets of snapshots at any interval (I use hourly,
# daily, and weekly) and delete the older ones automatically.

# Usage:
# This is called from /etc/cron.d like so:
# 0 * * * * root btrfs-snaps hourly 3 | egrep -v '(Create a snapshot of|Will delete the oldest|Delete subvolume|Making snapshot of )'
# 1 0 * * * root btrfs-snaps daily  4 | egrep -v '(Create a snapshot of|Will delete the oldest|Delete subvolume|Making snapshot of )'
# 2 0 * * 0 root btrfs-snaps weekly 4 | egrep -v '(Create a snapshot of|Will delete the oldest|Delete subvolume|Making snapshot of )'

: ${BTRFSROOT:=/mnt/btrfs_pool1}
DATE="$(date '+%Y%m%d_%H:%M:%S')"

type=${1:-hourly}
keep=${2:-3}

cd "$BTRFSROOT"

for i in $(btrfs subvolume list -q . | grep "parent_uuid -" | awk '{print $11}')
do
    # Skip duplicate dirs once a year on DST 1h rewind.
    test -d "$BTRFSROOT/${i}_${type}_$DATE" && continue
    echo "Making snapshot of $type"
    btrfs subvolume snapshot "$BTRFSROOT"/$i "$BTRFSROOT/${i}_${type}_$DATE"
    count="$(ls -d ${i}_${type}_* | wc -l)"
    clip=$(( $count - $keep ))
    if [ $clip -gt 0 ]; then
	echo "Will delete the oldest $clip snapshots for $type"
	for sub in $(ls -d ${i}_${type}_* | head -n $clip)
	do
	    #echo "Will delete $sub"
	    btrfs subvolume delete "$sub"
	done
    fi
done

2014/03/22 Btrfs Tips: Doing Fast Incremental Backups With Btrfs Send and Receive

π 2014-03-22 01:01 in Btrfs, Linux

Doing much faster incremental backups than rsync with btrfs send and btrfs receive

If you are doing backups with rsync, you know that on big filesystems, it takes a long time for rsync to scan all the files on each side before it can finally sync them. You also know that rsync does not track file renames (unless you use --fuzzy and the file isi in the same dirctory, and --fuzzy can be very expensive if you have directories with many files, I had it blow through my comcast bandwidth account when I was rsyncing maildir backups).

Just like ZFS, btrfs can compute a list of block changes between 2 snapshots and only send those blocks to the other side making the backups much much faster.
At the time I'm writing this, it does work, but there still a few bugs that could cause it to abort (no data loss, but it will stop to sync further unless you start over from scratch). Most of those bugs have been fixed in kernel 3.14, so it is recommended you use this unless you're just trying it out for testing.

How does it work?

This is all based on subvolumes, so please put all your data in subvolumes (even your root filesysem).

you make a read only snpahost at the source (let's say in /mnt/btrfs_pool1, you snapshot root to root_ro_timestamp)

you do one btrfs send/receive that sends that entire snapshot to the other side

the following times you tell btrfs send to send the diff between that last read only snapshot and a new one you just made.

on the other side, you only run btrfs receive in a btrfs block pool (let's say /mnt/btrfs_pool2). You do not give it any arguments linked to the backup name_ because it keeps track of the snapshot names from what was sent at the source.

If you'd like many more details, you can find some here:

http://lwn.net/Articles/506244

https://btrfs.wiki.kernel.org/index.php/Design_notes_on_Send/Receive

https://btrfs.wiki.kernel.org/index.php/Incremental_Backup

In real life, this is tedious to do by hand, and even the script to write is not super obvious, so I wrote one that I'm sharing here. I actually do a fair amount of backups on the same machine (like I backup the SSD on my laptop to a hard drive on the same laptop every hour, because SSDs fail, and they could fail while I away from home without my regular off laptop backups), but the script does allow sending the backup to another machine (--dest).

This backup script does a bit more in the following ways:

As per my post on hourly/daily/weekly snapshots, I like snapshots, so I am using this backup script's snapshots as local data recovery snapshots too, and therefore keep some amount of them behind, not just the last one (see -k num).

On my laptop, I want the destination snapshot to be writable and I want to know automatically which snapshot is the latest, so the script creates snapshot_last and snapshot_last_rw symlinks. Using them, I can boot my system from those snapshots and use the system normally if my main boot SSD dies and I need to boot from the HD. Thankfully btrfs supports using -o subvol=root_last_rw as a subvolume name and will follow the symlink to the real volume: root_rw.20140321_07:00:35

At the same time as creating the extra _ro and _rw snapshots for time based recovery, it automatically rotates them out and deletes the oldests (--keep says how many to keep).

As another option, Ruedi Steinmann wrote a more fancy btrbck. It's more complicated since it's much bigger and in java, but it's more featureful, so you may prefer that.

Here is a link to the latest version of my btrfs-subvolume-backup script and a paste of a potentially outdated version for you to look at:

#!/bin/bash

# By Marc MERLIN <marc_soft@merlins.org>
# License: Apache-2.0

# Source: http://marc.merlins.org/linux/scripts

# $Id: btrfs-subvolume-backup 1012 2014-06-25 21:56:54Z svnuser $
#
# Documentation and details at
# http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Tips_-Doing-Fast-Incremental-Backups-With-Btrfs-Send-and-Receive

# cron jobs might not have /sbin in their path.
export PATH="$PATH:/sbin"

set -o nounset
set -o errexit
set -o pipefail

# From https://btrfs.wiki.kernel.org/index.php/Incremental_Backup

# bash shortcut for `basename $0`
PROG=${0##*/}
lock=/var/run/$PROG

usage() {
    cat <<EOF
Usage: 
cd /mnt/source_btrfs_pool
$PROG [--init] [--keep|-k num] [--dest hostname] volume_name /mnt/backup_btrfs_pool

Options:
    --init:          Print this help message and exit.
    --keep num:      Keep the last snapshots for local backups (5 by default)
    --dest hostname: If present, ssh to that machine to make the copy.
    --diff:	     show an approximate diff between the snapshots

This will snapshot volume_name in a btrfs pool, and send the diff
between it and the previous snapshot (volume_name.last) to another btrfs
pool (on other drives)

If your backup destination is another machine, you'll need to add a few
ssh commands this script

The num sanpshots to keep is to give snapshots you can recover data from 
and they get deleted after num runs. Set to 0 to disable (one snapshot will
be kept since it's required for the next diff to be computed).
EOF
    exit 0
}

die () {
    msg=${1:-}
    # don't loop on ERR
    trap ' ERR

    rm $lock

    echo "$msg" >&2
    echo >&2

    # This is a fancy shell core dumper
    if echo $msg | grep -q 'Error line .* with status'; then
	line=`echo $msg | sed 's/.*Error line \(.*\) with status.*/\1/'`
	echo " DIE: Code dump:" >&2
	nl -ba $0 | grep -3 "\b$line\b" >&2
    fi

    exit 1
}

# Trap errors for logging before we die (so that they can be picked up
# by the log checker)
trap 'die "Error line $LINENO with status $?"' ERR

init=""
# Keep the last 5 snapshots by default
keep=5
TEMP=$(getopt --longoptions help,usage,init,keep:,dest:,prefix:,diff -o h,k:,d:,p: -- "$@") || usage
dest=localhost
ssh=""
pf=""
diff=""

# getopt quotes arguments with ' We use eval to get rid of that
eval set -- $TEMP

while :
do
    case "$1" in
        -h|--help|--usage)
            usage
            shift
            ;;

	--prefix|-p)
	    shift
	    pf=_$1
	    lock="$lock.$pf"
	    shift
	    ;;

	--keep|-k)
	    shift
	    keep=$1
	    shift
	    ;;

	--dest|-d)
	    shift
	    dest=$1
	    ssh="ssh $dest"
	    shift
	    ;;

	--init)
	    init=1
	    shift
	    ;;

	--diff)
	    diff=1
	    shift
	    ;;

	--)
	    shift
	    break
	    ;;

        *) 
	    echo "Internal error from getopt!"
	    exit 1
	    ;;
    esac
done
[ $keep < 1 ]] && die "Must keep at least one snapshot for things to work ($keep given)"

DATE="$(date '+%Y%m%d_%H:%M:%S')"

[ $# != 2 ]] && usage
vol="$1"
dest_pool="$2"

# shlock (from inn) does the right thing and grabs a lock for a dead process
# (it checks the PID in the lock file and if it's not there, it
# updates the PID with the value given to -p)
if ! shlock -p $$ -f $lock; then
    echo "$lock held for $PROG, quitting" >&2
    exit
fi

if [ -z "$init" ]]; then
    test -e "${vol}${pf}_last" 	|| die "Cannot sync $vol, ${vol}${pf}_last missing. Try --init?"
    src_snap="$(readlink -e ${vol}${pf}_last)"
fi
src_newsnap="${vol}${pf}_ro.$DATE"
src_newsnaprw="${vol}${pf}_rw.$DATE"

$ssh test -d "$dest_pool/" || die "ABORT: $dest_pool not a directory (on $dest)"

btrfs subvolume snapshot -r "$vol" "$src_newsnap"

if [ -n "$diff" ]]; then
    echo diff between "$src_snap" "$src_newsnap"
    btrfs-diff "$src_snap" "$src_newsnap"
fi

# There is currently an issue that the snapshots to be used with "btrfs send"
# must be physically on the disk, or you may receive a "stale NFS file handle"
# error. This is accomplished by "sync" after the snapshot
sync

if [ -n "$init" ]]; then
    btrfs send "$src_newsnap" | $ssh btrfs receive "$dest_pool/"
else
    btrfs send -p "$src_snap" "$src_newsnap" | $ssh btrfs receive "$dest_pool/"
fi

# We make a read-write snapshot in case you want to use it for a chroot
# and some testing with a writeable filesystem or want to boot from a
# last good known snapshot.
btrfs subvolume snapshot "$src_newsnap" "$src_newsnaprw"
$ssh btrfs subvolume snapshot "$dest_pool/$src_newsnap" "$dest_pool/$src_newsnaprw"

# Keep track of the last snapshot to send a diff against.
ln -snf $src_newsnap ${vol}${pf}_last
# The rw version can be used for mounting with subvol=vol_last_rw
ln -snf $src_newsnaprw ${vol}${pf}_last_rw
$ssh ln -snf $src_newsnaprw $dest_pool/${vol}${pf}_last_rw

# How many snapshots to keep on the source btrfs pool (both read
# only and read-write).
ls -rd ${vol}${pf}_ro* | tail -n +$(( $keep + 1 ))| while read snap
do
    btrfs subvolume delete "$snap" | grep -v 'Transaction commit:'
done
ls -rd ${vol}${pf}_rw* | tail -n +$(( $keep + 1 ))| while read snap
do
    btrfs subvolume delete "$snap" | grep -v 'Transaction commit:'
done

# Same thing for destination (assume the same number of snapshots to keep,
# you can change this if you really want).
$ssh ls -rd $dest_pool/${vol}${pf}_ro* | tail -n +$(( $keep + 1 ))| while read snap
do
    $ssh btrfs subvolume delete "$snap" | grep -v 'Transaction commit:'
done
$ssh ls -rd $dest_pool/${vol}${pf}_rw* | tail -n +$(( $keep + 1 ))| while read snap
do
    $ssh btrfs subvolume delete "$snap" | grep -v 'Transaction commit:'
done

rm $lock

2014/03/23 Btrfs Raid5 Status

π 2014-03-23 01:01 in Btrfs, Linux

How to use Btrfs raid5/6

Since I didn't find good documentation of where Btrfs raid5/raid6 was at, I did some tests, and with some help from list members, can write this page now.

This is as of kernel 3.14 with btrfs-tools 3.12. If your are using a kernel and especially tools older than that, there are good chances things will work less well.

Btrfs raid5/6 in a nutshell

It is important to know that raid5/raid6 is more experimental than btrfs itself is. Do not use this for production systems, or if you do and things break, you were warned :)

If you're coming from the mdadm raid5 world, here's what you need to know:

Btrfs is still experimental, but raid5/6 is experimental within btrfs (in other words quite unfinished).

As of 3.14, it works if everything goes right, but the error handling is still lacking. Unexpected conditions are likely to cause unexpected failures. Buyer beware :)

scrub cannot fix issues with raid5/6 yet. This means that if you have any checksum problem, your filesystem will be in a bad state.

btrfs does not yet seem to know that if you removed a drive from an array and you plug it back in later, that drive is out of date. It will auto-add an out of date drive back to an array and that will likely cause data loss by hiding files you had but the old drive didn't have. This means you should wipe a drive cleanly before you put it back into an array it used to be part of. See https://bugzilla.kernel.org/show_bug.cgi?id=72811

btrfs does not deal well with a drive that is present but not working. It does not know how to kick it from the array, nor can it be removed (btrfs device delete) because this causes reading from the drive that isn't working. This means btrfs will try to write to the bad drive forever. The solution there is to umount the array, remount it with the bad drive missing (it cannot be seen by btrfs, or it'll get automounted/added), and then rebuild on a new drive or rebuild/shrink the array to be one drive smaller (this is explained below).

You can add and remove drives from an array and rebalance to grow/shrink an array without umounting it. Note that is slow since it forces rewriting of all data blocks, and this takes about 3H per 100GB (or 30H per terabyte) with 10 drives on a dual core duo.

If you are missing a drive, btrfs will refuse to mount the array and give an obscure error unless you mount with -o degraded

btrfs has no special rebuild procedure. Rebuilding is done by rebalancing the array. You could actualy rebalance a degraded array to a smaller array by rebuilding/balancing without adding a drive, or you can add a drive, rebalance on it, and that will force a read/rewrite of all data blocks, which will restripe them nicely.

btrfs replace does not work, but you can easily do btrfs device add, and btrfs remove of the other drive, and this will do the same thing.

btrfs device add will not cause an auto rebalance. You could chose not to rebalance existing data and only have new data be balanced properly.

btrfs device delete will force all data from the deleted drive to be rebalanced and the command completes when the drive has been freed up.

The magic command to delete an unused drive from an array while it is missing from the system is btrfs device delete missing .

btrfs doesn't easily tell you that your array is in degraded mode (run btrfs fi show, and it'll show a missing drive as well as how much of your total data is still on it). This does means you can have an array that is half degraded: half the files are striped over the current drives because they were written after the drive was removed, or were written by a rebalance that hasn't finished, while the other half of your data could be in degraded mode.

You can see this by looking at the amount of data on each drive, anything on drive 11 is properly striped 10 way, while anything on drive 3 is in degraded mode:

polgara:~# btrfs fi show
Label: backupcopy  uuid: eed9b55c-1d5a-40bf-a032-1be6980648e1
        Total devices 11 FS bytes used 564.54GiB
        devid    1 size 465.76GiB used 63.14GiB path /dev/dm-0
        devid    2 size 465.76GiB used 63.14GiB path /dev/dm-1
        devid    3 size 465.75GiB used 30.00GiB path   <- this device is missing
        devid    4 size 465.76GiB used 63.14GiB path /dev/dm-2
        devid    5 size 465.76GiB used 63.14GiB path /dev/dm-3
        devid    6 size 465.76GiB used 63.14GiB path /dev/dm-4
        devid    7 size 465.76GiB used 63.14GiB path /dev/mapper/crypt_sdi1
        devid    8 size 465.76GiB used 63.14GiB path /dev/mapper/crypt_sdj1
        devid    9 size 465.76GiB used 63.14GiB path /dev/dm-7
        devid    10 size 465.76GiB used 63.14GiB path /dev/dm-8
        devid    11 size 465.76GiB used 33.14GiB path /dev/mapper/crypt_sde1 <- this device was added

Create a raid5 array

polgara:/dev/disk/by-id# mkfs.btrfs -f -d raid5 -m raid5 -L backupcopy /dev/mapper/crypt_sd[bdfghijkl]1

WARNING! - Btrfs v3.12 IS EXPERIMENTAL WARNING! - see http://btrfs.wiki.kernel.org before using

Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 Turning ON incompat feature 'raid56': raid56 extended format adding device /dev/mapper/crypt_sdd1 id 2 adding device /dev/mapper/crypt_sdf1 id 3 adding device /dev/mapper/crypt_sdg1 id 4 adding device /dev/mapper/crypt_sdh1 id 5 adding device /dev/mapper/crypt_sdi1 id 6 adding device /dev/mapper/crypt_sdj1 id 7 adding device /dev/mapper/crypt_sdk1 id 8 adding device /dev/mapper/crypt_sdl1 id 9 fs created label backupcopy on /dev/mapper/crypt_sdb1 nodesize 16384 leafsize 16384 sectorsize 4096 size 4.09TiB polgara:/dev/disk/by-id# mount -L backupcopy /mnt/btrfs_backupcopy
polgara:/mnt/btrfs_backupcopy# df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/crypt_sdb1 4.1T 3.0M 4.1T 1% /mnt/btrfs_backupcopy

As another example, you could use -d raid5 -m raid1 to have metadata be raid1 while data being raid5. This specific example isn't actually that useful, but just giving it as an example.

Replacing a drive that hasn't failed yet on a running raid5 array

btrfs replace does not work:

polgara:/mnt/btrfs_backupcopy# btrfs replace start -r /dev/mapper/crypt_sem1 /dev/mapper/crypt_sdm1  .
Mar 23 14:56:06 polgara kernel: [53501.511493] BTRFS warning (device dm-9): dev_replace cannot yet handle RAID5/RAID6

No big deal, this can be done in 2 steps:

Add the new drive

polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 .
polgara:/mnt/btrfs_backupcopy# btrfs fi show
Label: backupcopy  uuid: eed9b55c-1d5a-40bf-a032-1be6980648e1
        Total devices 11 FS bytes used 114.35GiB
        devid    1 size 465.76GiB used 32.14GiB path /dev/dm-0
        devid    2 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdd1
        devid    4 size 465.76GiB used 32.14GiB path /dev/dm-2
        devid    5 size 465.76GiB used 32.14GiB path /dev/dm-3
        devid    6 size 465.76GiB used 32.14GiB path /dev/dm-4
        devid    7 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdi1
        devid    8 size 465.76GiB used 32.14GiB path /dev/dm-6
        devid    9 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdk1
        devid    10 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdl1
        devid    11 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sde1
        devid    12 size 465.75GiB used 0.00 path /dev/mapper/crypt_sdm1

btrfs device delete the drive to remove. This neatly causes a rebalance which will happen to use the new drive you just added

polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sde1 .
Mar 23 11:13:31 polgara kernel: [40145.908207] BTRFS info (device dm-9): relocating block group 945203314688 flags 129
Mar 23 14:51:51 polgara kernel: [53245.955444] BTRFS info (device dm-9): found 5576 extents
Mar 23 14:51:57 polgara kernel: [53251.874925] BTRFS info (device dm-9): found 5576 extents
polgara:/mnt/btrfs_backupcopy#

Note that this is slow, 3.5h for just 115GB of data. It could take days for a terabyte array.

polgara:/mnt/btrfs_backupcopy# btrfs fi show Label: backupcopy uuid: eed9b55c-1d5a-40bf-a032-1be6980648e1 Total devices 10 FS bytes used 114.35GiB devid 1 size 465.76GiB used 13.14GiB path /dev/dm-0 devid 2 size 465.76GiB used 13.14GiB path /dev/mapper/crypt_sdd1 devid 4 size 465.76GiB used 13.14GiB path /dev/dm-2 devid 5 size 465.76GiB used 13.14GiB path /dev/dm-3 devid 6 size 465.76GiB used 13.14GiB path /dev/dm-4 devid 7 size 465.76GiB used 13.14GiB path /dev/mapper/crypt_sdi1 devid 8 size 465.76GiB used 13.14GiB path /dev/dm-6 devid 9 size 465.76GiB used 13.14GiB path /dev/mapper/crypt_sdk1 devid 10 size 465.76GiB used 13.14GiB path /dev/mapper/crypt_sdl1 devid 12 size 465.75GiB used 13.14GiB path /dev/mapper/crypt_sdm1

There we go, I'm back on 10 devices, almost as good as a btrfs replace, it simply took 2 steps

Replacing a missing drive on a running raid5 array

Normal mount will not work:

polgara:~# mount -v -t btrfs -o compress=zlib,space_cache,noatime LABEL=backupcopy /mnt/btrfs_backupcopy mount: wrong fs type, bad option, bad superblock on /dev/mapper/crypt_sdj1, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so Mar 21 22:29:45 polgara kernel: [ 2288.285068] BTRFS info (device dm-8): disk space caching is enabled Mar 21 22:29:45 polgara kernel: [ 2288.285369] BTRFS: failed to read the system array on dm-8 Mar 21 22:29:45 polgara kernel: [ 2288.316067] BTRFS: open_ctree failed

So we do a mount with -o degraded polgara:~# mount -v -t btrfs -o compress=zlib,space_cache,noatime,degraded LABEL=backupcopy /mnt/btrfs_backupcopy /dev/mapper/crypt_sdj1 on /mnt/btrfs_backupcopy type btrfs (rw,noatime,compress=zlib,space_cache,degraded) Mar 21 22:29:51 polgara kernel: [ 2295.042421] BTRFS: device label backupcopy devid 8 transid 3446 /dev/mapper/crypt_sdj1 Mar 21 22:29:51 polgara kernel: [ 2295.065951] BTRFS info (device dm-8): allowing degraded mounts Mar 21 22:29:51 polgara kernel: [ 2295.065955] BTRFS info (device dm-8): disk space caching is enabled Mar 21 22:30:32 polgara kernel: [ 2336.189000] BTRFS: device label backupcopy devid 3 transid 8 /dev/dm-9 Mar 21 22:30:32 polgara kernel: [ 2336.203175] BTRFS: device label backupcopy devid 3 transid 8 /dev/dm-9

Then we add the new drive:

polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sde1 .
polgara:/mnt/btrfs_backupcopy# df .
/dev/dm-0       5.1T  565G  4.0T  13% /mnt/btrfs_backupcopy   < bad, it should be 4.5T, but I get space for 11 drives

https://btrfs.wiki.kernel.org/index.php/FAQ#What_does_.22balance.22_do.3F says:
"On a filesystem with damaged replication (e.g. a RAID-1 FS with a dead and removed disk), it will force the FS to rebuild the missing copy of the data on one of the currently active devices, restoring the RAID-1 capability of the filesystem."

If we have written data since the drive was removed, or if we are recovering from a unfinished balance, doing a filter on devid=3 tells balance to only rewrite data and metadata that has a chunk on missing device #3 (this is a good way to finish the balance in multiple passes if you have to reboot in between, or the filesystem deadlocks during a balance, which unfortunately is still common as of kernel 3.14.


polgara:/mnt/btrfs_backupcopy# btrfs balance start -ddevid=3 -mdevid=3 -v .
Mar 22 13:15:55 polgara kernel: [20275.690827] BTRFS info (device dm-9): relocating block group 941277446144 flags 130
Mar 22 13:15:56 polgara kernel: [20276.604760] BTRFS info (device dm-9): relocating block group 940069486592 flags 132
Mar 22 13:19:27 polgara kernel: [20487.196844] BTRFS info (device dm-9): found 52417 extents
Mar 22 13:19:28 polgara kernel: [20488.056749] BTRFS info (device dm-9): relocating block group 938861527040 flags 132
Mar 22 13:22:41 polgara kernel: [20681.588762] BTRFS info (device dm-9): found 70146 extents
Mar 22 13:22:42 polgara kernel: [20682.380957] BTRFS info (device dm-9): relocating block group 937653567488 flags 132
Mar 22 13:26:12 polgara kernel: [20892.816204] BTRFS info (device dm-9): found 71497 extents
Mar 22 13:26:14 polgara kernel: [20894.819258] BTRFS info (device dm-9): relocating block group 927989891072 flags 129

As balancing happens, data is taken out of devid3, the one missing, and added to devid11 (the one added):

polgara:~# btrfs fi show
Label: backupcopy  uuid: eed9b55c-1d5a-40bf-a032-1be6980648e1
        Total devices 11 FS bytes used 564.54GiB
        devid    1 size 465.76GiB used 63.14GiB path /dev/dm-0
        devid    2 size 465.76GiB used 63.14GiB path /dev/dm-1
        devid    3 size 465.75GiB used 30.00GiB path   <- this device is missing
        devid    4 size 465.76GiB used 63.14GiB path /dev/dm-2
        devid    5 size 465.76GiB used 63.14GiB path /dev/dm-3
        devid    6 size 465.76GiB used 63.14GiB path /dev/dm-4
        devid    7 size 465.76GiB used 63.14GiB path /dev/mapper/crypt_sdi1
        devid    8 size 465.76GiB used 63.14GiB path /dev/mapper/crypt_sdj1
        devid    9 size 465.76GiB used 63.14GiB path /dev/dm-7
        devid    10 size 465.76GiB used 63.14GiB path /dev/dm-8
        devid    11 size 465.76GiB used 33.14GiB path /dev/mapper/crypt_sde1 <- this device was added

You can see status with:

polgara:/mnt/btrfs_backupcopy# while :
> do
> btrfs balance status .
> sleep 60
1 out of about 72 chunks balanced (2 considered),  99% left
2 out of about 72 chunks balanced (3 considered),  97% left
3 out of about 72 chunks balanced (4 considered),  96% left

At the end (and this can take hours to days), you get:

polgara:/mnt/btrfs_backupcopy# btrfs fi show
Label: backupcopy  uuid: eed9b55c-1d5a-40bf-a032-1be6980648e1
        Total devices 11 FS bytes used 114.35GiB
        devid    1 size 465.76GiB used 32.14GiB path /dev/dm-0
        devid    2 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdd1
        devid    3 size 465.75GiB used 0.00 path  <----  drive is freed up now.
        devid    4 size 465.76GiB used 32.14GiB path /dev/dm-2
        devid    5 size 465.76GiB used 32.14GiB path /dev/dm-3
        devid    6 size 465.76GiB used 32.14GiB path /dev/dm-4
        devid    7 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdi1
        devid    8 size 465.76GiB used 32.14GiB path /dev/dm-6
        devid    9 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdk1
        devid    10 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdl1
        devid    11 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sde1
Btrfs v3.12

But the array still shows 11 drives with one missing and will not mount without -o degraded.
You do this with:

polgara:/mnt/btrfs_backupcopy# btrfs device delete missing .
polgara:/mnt/btrfs_backupcopy# btrfs fi show
Label: backupcopy  uuid: eed9b55c-1d5a-40bf-a032-1be6980648e1
        Total devices 10 FS bytes used 114.35GiB
        devid    1 size 465.76GiB used 32.14GiB path /dev/dm-0
        devid    2 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdd1
        devid    4 size 465.76GiB used 32.14GiB path /dev/dm-2
        devid    5 size 465.76GiB used 32.14GiB path /dev/dm-3
        devid    6 size 465.76GiB used 32.14GiB path /dev/dm-4
        devid    7 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdi1
        devid    8 size 465.76GiB used 32.14GiB path /dev/dm-6
        devid    9 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdk1
        devid    10 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdl1
        devid    11 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sde1

And there we go, we're back in business!

From the above, you've also learned how to grow a raid5 array (add a drive, run balance), or remove a drive (just run btrfs device delete and the auto balance will restripe your entire array for n-1 drives).

2014/04/26 Btrfs Tips: Cancel A Btrfs Scrub That Is Already Stopped

π 2014-04-26 01:01 in Btrfs, Linux

How to cancel a btrfs scrub that won't cancel

In some cases, btrfs scrub can be interrupted in a way that leaves it in a half state. State is stored in /var/lib/btrfs/scrub.status.UUID and if the relevant file indicates that scrub is still running (even though it is not), a new scrub cannot be started, nor the already stopped one cancelled.

This is fixed as shown below:

Problem:

gargamel:~# btrfs scrub start -d /dev/mapper/dshelf1
ERROR: scrub is already running.
To cancel use 'btrfs scrub cancel /dev/mapper/dshelf1'.
gargamel:~# btrfs scrub status  /dev/mapper/dshelf1
scrub status for 6358304a-2234-4243-b02d-4944c9af47d7
        scrub started at Tue Apr  8 08:36:18 2014, running for 46347 seconds
        total bytes scrubbed: 5.70TiB with 0 errors
gargamel:~# btrfs scrub cancel  /dev/mapper/dshelf1
ERROR: scrub cancel failed on /dev/mapper/dshelf1: not running

Fix:

gargamel:~# perl -pi -e 's/finished:0/finished:1/' /var/lib/btrfs/*

Verification:

gargamel:~# btrfs scrub status  /dev/mapper/dshelf1
scrub status for 6358304a-2234-4243-b02d-4944c9af47d7
        scrub started at Tue Apr  8 08:36:18 2014 and finished after 46347 seconds
        total bytes scrubbed: 5.70TiB with 0 errors
gargamel:~# btrfs scrub start -d /dev/mapper/dshelf1
scrub started on /dev/mapper/dshelf1, fsid 6358304a-2234-4243-b02d-4944c9af47d7 (pid=24196)

2014/04/27 Btrfs Multi Device Dmcrypt

π 2014-04-27 01:01 in Btrfs, Linux

How to manage a btrfs filesystem made out of multiple dmcrypt'ed drives

If you are using raid0, raid1, raid10, raid5, or raid6 with btrfs and you want your filesystem to be encrypted, you need to encrypt each device seperately but later you'll want a script to decrypt all those devices.
This can be done with /etc/crypttab, but I don't personally use it for arrays that I turn off to save power. You can use keyscript= in there to feed a script that will provide the decryption key, but I wrote my own script to tun the disks on, locate them by disk ID, decrypt them, and mount the resulting partition.

If you are planning on using Raid5 or Raid6, you'll also want to read this page.

For the mount to work, you of course have to create the crypted device and filesystem first. Here is a recommended way:

cryptsetup luksFormat -s 256 -c aes-xts-plain64 /dev/sda4 cryptsetup luksFormat -s 256 -c aes-xts-plain64 /dev/sdb4 cryptsetup luksFormat -s 256 -c aes-xts-plain64 /dev/sdc4

cryptsetup luksOpen /dev/sda4 sda4_crypt cryptsetup luksOpen /dev/sdab sdb4_crypt cryptsetup luksOpen /dev/sdac sdc4_crypt

mkfs.btrfs -d raid0 -m raid0 -L btrfs_pool /dev/mapper/sd[abc]4_crypt

After reboot, the idea is to avoid the luksOpen steps and adapt to whatever device names those drives could come up under, and this is what the script below does.

Here is the script, start-btrfs-dmcrypt, for which I'll paste a most likely outdated copy here:

#!/bin/bash

# Example script to decrypt a bunch of drives and then mount them as
# part of a btrfs volume.
# 
# By Marc MERLIN <marc_soft@merlins.org> / 2014/04/29
# License: Apache-2.0

# Get these from /dev/disk/by-id
DRIVES="
scsi-SATA_Hitachi_HDS7230_MN5220F323S79K-part1
scsi-SATA_Hitachi_HDS7230_MN5220F325UZMK-part1
scsi-SATA_ST2000DL003-9VT_5YD6MH88-part1
scsi-SATA_ST2000DL003-9VT_5YD70NHX-part1
scsi-SATA_WDC_WD20EARS-00_WD-WMAZA0374092-part1
"

# The label name of your btrfs filesystem (mkfs.btrfs -L btrfs_pool)
LABEL=btrfs_pool

NUMDRIVES=$(echo $DRIVES | wc -w)

die () {
    echo "$1"
    exit 1
}

pwd="$(yourscript that returns crypt key)"
if [ -z "$pwd" ]]; then
    echo -n "Decryption key? "
    stty -echo 2>/dev/null
    read pwd
    stty echo 2>/dev/null
fi
[ -z "$pwd" ]] && die "Didn't get a decryption key"

# Here you can run a command to turn the disks on if they are on an
# external power outlet.
# turn-disks-on-cmd
cd /dev/disk/by-id
for i in 1 2 3 4 5 6 7 8 9
do
    if [ $(ls $DRIVES 2>/dev/null | wc -l) = $NUMDRIVES ]]; then
	break
    fi
    sleep 10
done
# This is useful if the disks were just turned on.
/etc/init.d/smartmontools restart

for i in $DRIVES
do 
    dev=$(ls -l $i | awk '{print $11}' | sed "s#../..#/dev#")
    [ -z "$dev" ]]  && die "Couldn't find device for $i"
    echo "$pwd" | cryptsetup luksOpen "$dev" "crypt_$(basename $dev)" || die "Couldn't decrypt $dev"
    echo "decrypt $dev"
done
btrfs device scan
mkdir -p /mnt/btrfs_pool
mount -v -t btrfs -o compress=zlib,noatime LABEL=$LABEL /mnt/btrfs_pool || die "Couldn't find btrfs $LABEL"

2014/05/04 Fixing Btrfs Filesystem Full Problems

π 2014-05-04 01:01 in Btrfs, Linux

Fixing Btrfs Filesystem Full Problems

Clear space now

If you have historical snapshots, the quickest way to get space back so that you can look at the filesystem and apply better fixes and cleanups is to drop the oldest historical snapshots.

Two things to note:

If you have historical snapshots as described here, delete the oldest ones first, and wait (see below). However if you just just deleted 100GB, and replaced it with another 100GB which failed to fully write, giving you out of space, all your snapshots will have to be deleted to clear the blocks of that old file you just removed to make space for the new one (actually if you know exactly what file it is, you can go in all your snapshots and manually delete it, but in the common case it'll be multiple files and you won't know which ones, so you'll have to drop all your snapshots before you get the space back).

After deleting snapshots, it can take a minute or more for btrfs fi show to show the space freed. Do not be too impatient, run btrfs fi show in a loop and see if the number changes every minute. If it does not, carry on and delete other snapshots or look at rebalancing.

Note that even in the cases described below, you may have to clear one snapshot or more to make space before btrfs balance can run. As a corollary, btrfs can get in states where it's hard to get it out of the 'no space' state it's in. As a result, even if you don't need snapshot, keeping at least one around to free up space should you hit that mis-feature/bug, can be handy

Is your filesystem really full? Mis-balanced metadata and/or data chunks

Below, you'll see how to rebalance data blocks and metadata, and you are unlucky enough to get a filesystem full error before you balance, try running this first:

legolas:~# btrfs balance start -musage=0 /mnt/btrfs_pool1 &
legolas:~# btrfs balance start -dusage=0 /mnt/btrfs_pool1 &

A null rebalance will help in some cases, if not read on.

Also, if you are really unlucky, you might get in a no more space error that requires adding a temporary block device to your filesystem to allow balance to run. See below for details.

Pre-emptively rebalancing your filesystem

In an ideal world, btrfs would do this for you, but it does not.
I personally recommend you do a rebalance weekly or nightly as part of of a btrfs scrub cron job. See the btrfs-scrub script.

Is your filesystem really full? Mis-balanced data chunks

Look at filesystem show output:

legolas:~# btrfs fi show
Label: btrfs_pool1  uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6
	Total devices 1 FS bytes used 441.69GiB
	devid    1 size 865.01GiB used 751.04GiB path /dev/mapper/cryptroot

Only about 50% of the space is used (441 out of 865GB), but the device is 88% full (751 out of 865MB). Unfortunately it's not uncommon for a btrfs device to fill up due to the fact that it does not rebalance chunks (3.18+ has started freeing empty chunks, which is a step in the right direction).

In the case above, because the filesystem is only 55% full, I can ask balance to rewrite all chunks that have less than 55% space used. Rebalancing those blocks actually means taking the data in those blocks, and putting it in fuller blocks so that you end up being able to free the less used blocks.
This means the bigger the -dusage value, the more work balance will have to do (i.e. taking fuller and fuller blocks and trying to free them up by putting their data elsewhere). Also, if your FS is 55% full, using -dusage=55 is ok, but there isn't a 1 to 1 correlation and you'll likely be ok with a smaller dusage number, so start small and ramp up as needed.


legolas:~# btrfs balance start -dusage=55 /mnt/btrfs_pool1 &

# Follow the progress along with:
legolas:~# while :; do btrfs balance status -v /mnt/btrfs_pool1; sleep 60; done
Balance on '/mnt/btrfs_pool1' is running
10 out of about 315 chunks balanced (22 considered),  97% left
Dumping filters: flags 0x1, state 0x1, force is off
  DATA (flags 0x2): balancing, usage=55
Balance on '/mnt/btrfs_pool1' is running
16 out of about 315 chunks balanced (28 considered),  95% left
Dumping filters: flags 0x1, state 0x1, force is off
  DATA (flags 0x2): balancing, usage=55
(...)

When it's over, the filesystem now looks like this (note devid used is now 513GB instead of 751GB):

legolas:~# btrfs fi show
Label: btrfs_pool1  uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6
	Total devices 1 FS bytes used 441.64GiB
	devid    1 size 865.01GiB used 513.04GiB path /dev/mapper/cryptroot

Before you ask, yes, btrfs should do this for you on its own, but currently doesn't as of 3.14.

Is your filesystem really full? Misbalanced metadata

Unfortunately btrfs has another failure case where the metadata space can fill up. When this happens, even though you have data space left, no new files will be writeable.

In the example below, you can see Metadata DUP 9.5GB out of 10GB. Btrfs keeps 0.5GB for itself, so in the case above, metadata is full and prevents new writes.

One suggested way is to force a full rebalance, and in the example below you can see metadata goes back down to 7.39GB after it's done. Yes, there again, it would be nice if btrfs did this on its own. It will one day (some if it is now in 3.18).

Sometimes, just using -dusage=0 is enough to rebalance metadata (this is now done automatically in 3.18 and above), but if it's not enough, you'll have to increase the number.


legolas:/mnt/btrfs_pool2# btrfs fi df .
Data, single: total=800.42GiB, used=636.91GiB
System, DUP: total=8.00MiB, used=92.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=10.00GiB, used=9.50GiB
Metadata, single: total=8.00MiB, used=0.00

legolas:/mnt/btrfs_pool2# btrfs balance start -v -dusage=0 /mnt/btrfs_pool2
Dumping filters: flags 0x1, state 0x0, force is off
  DATA (flags 0x2): balancing, usage=0
  Done, had to relocate 91 out of 823 chunks

legolas:/mnt/btrfs_pool2# btrfs fi df .
Data, single: total=709.01GiB, used=603.85GiB
System, DUP: total=8.00MiB, used=88.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=10.00GiB, used=7.39GiB
Metadata, single: total=8.00MiB, used=0.00

Balance cannot run because the filesystem is full

If a null rebalance (-musage=0 and then -dusage=0 explained above) doesn't work, one last trick to get around this is to add a device (even a USB key will do) to your btrfs filesystem. This should allow balance to start, and then you can remove the device with btrfs device delete when the balance is finished.

Note, it's even possible for a filesystem to be full enough in a way that you cannot even delete snapshots to free space. This shows how you would work around it:

root@polgara:/mnt/btrfs_pool2# btrfs fi df . Data, single: total=159.67GiB, used=80.33GiB System, single: total=4.00MiB, used=24.00KiB Metadata, single: total=8.01GiB, used=7.51GiB <<<< BAD root@polgara:/mnt/btrfs_pool2# btrfs balance start -v -dusage=0 /mnt/btrfs_pool2 Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=0 Done, had to relocate 0 out of 170 chunks root@polgara:/mnt/btrfs_pool2# btrfs balance start -v -dusage=1 /mnt/btrfs_pool2 Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=1 ERROR: error during balancing '/mnt/btrfs_pool2' - No space left on device There may be more info in syslog - try dmesg | tail root@polgara:/mnt/btrfs_pool2# dd if=/dev/zero of=/var/tmp/btrfs bs=1G count=5 5+0 records in 5+0 records out 5368709120 bytes (5.4 GB) copied, 7.68099 s, 699 MB/s root@polgara:/mnt/btrfs_pool2# losetup -v -f /var/tmp/btrfs Loop device is /dev/loop0 root@polgara:/mnt/btrfs_pool2# btrfs device add /dev/loop0 . Performing full device TRIM (5.00GiB) ...

# optional step if you have snapshots to delete, if not try the balance below root@polgara:/mnt/btrfs_pool2# btrfs subvolume delete space2_daily_20140603_00:05:01 Delete subvolume '/mnt/btrfs_pool2/space2_daily_20140603_00:05:01' root@polgara:/mnt/btrfs_pool2# for i in *daily*; do btrfs subvolume delete $i; done Delete subvolume '/mnt/btrfs_pool2/space2_daily_20140604_00:05:01' Delete subvolume '/mnt/btrfs_pool2/space2_daily_20140605_00:05:01' Delete subvolume '/mnt/btrfs_pool2/space2_daily_20140606_00:05:01' Delete subvolume '/mnt/btrfs_pool2/space2_daily_20140607_00:05:01' Delete subvolume '/mnt/btrfs_pool2/space2_daily_20140608_00:05:01' Delete subvolume '/mnt/btrfs_pool2/space2_daily_20140609_00:05:01'

root@polgara:/mnt/btrfs_pool2# btrfs balance start -v -dusage=1 /mnt/btrfs_pool2 Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=1 Done, had to relocate 5 out of 169 chunks root@polgara:/mnt/btrfs_pool2# btrfs device delete /dev/loop0 .

root@polgara:/mnt/btrfs_pool2# btrfs fi df . Data, single: total=154.01GiB, used=80.06GiB System, single: total=4.00MiB, used=28.00KiB Metadata, single: total=8.01GiB, used=4.88GiB <<< GOOD

Misc Balance Resources

For more info, please read:

https://btrfs.wiki.kernel.org/index.php/FAQ#Raw_disk_usage

https://btrfs.wiki.kernel.org/index.php/Balance_Filters

2014/05/19 Btrfs-diff Between Snapshots

π 2014-05-19 01:01 in Btrfs, Linux

Differences between two btrfs snapshots

When you have historical snapshots, it may be useful to know what changed between 2 snapshots.

The best way to do this long term is to modify "btrfs send" to compute changes between the snapshots and just output the filelist instead of a stream with data.

However, until then, there is a hack that shows you files that got added and removed between two snapshots. It's not bulletproof like btrfs send, but it can give you a quick mostly working diff between two snapshots (*it will not show renames or deletes*). See more caveats on this original serverfault post.

legolas:/mnt/btrfs_pool1# btrfs-diff usr_ro.20140513_05:00:01/ usr_ro.20140514_06:00:02/ share/doc/linux-image-3.15.0-rc5-amd64-i915-preempt-20140216s1/buildinfo.gz share/doc/linux-image-3.15.0-rc5-amd64-i915-preempt-20140216s1/Buildinfo.gz share/doc/linux-image-3.15.0-rc5-amd64-i915-preempt-20140216s1/changelog.Debian.gz share/doc/linux-image-3.15.0-rc5-amd64-i915-preempt-20140216s1/Changes.gz (...)

You can download my latest snapshot of btrfs-diff. Note that I am not the author, it was copied from this serverfault post.

#!/bin/bash

# Author: http://serverfault.com/users/96883/artfulrobot
# License: Unknown
#
# This script will show most files that got modified or added.
# Renames and deletions will not be shown.
# Read limitations on:
# http://serverfault.com/questions/399894/does-btrfs-have-an-efficient-way-to-compare-snapshots
# 
# btrfs send is the best way to do this long term, but as of kernel
# 3.14, btrfs send cannot just send a list of changed files without
# scanning and sending all the changed data blocks along.

usage() { echo $@ >&2; echo "Usage: $0 <older-snapshot> <newer-snapshot>" >&2; exit 1; }

[ $# -eq 2 ] || usage "Incorrect invocation";
SNAPSHOT_OLD=$1;
SNAPSHOT_NEW=$2;

[ -d $SNAPSHOT_OLD ] || usage "$SNAPSHOT_OLD does not exist";
[ -d $SNAPSHOT_NEW ] || usage "$SNAPSHOT_NEW does not exist";

OLD_TRANSID=`btrfs subvolume find-new "$SNAPSHOT_OLD" 9999999`
OLD_TRANSID=${OLD_TRANSID#transid marker was }
[ -n "$OLD_TRANSID" -a "$OLD_TRANSID" -gt 0 ] || usage "Failed to find generation for $SNAPSHOT_NEW"

btrfs subvolume find-new "$SNAPSHOT_NEW" $OLD_TRANSID | sed '$d' | cut -f17- -d' ' | sort | uniq

2014/05/20 Historical Snapshots of Backups With Btrfs

π 2014-05-20 01:01 in Btrfs, Linux

How to manage historical snapshots of backups with Btrfs

I have a setup where I backup a certain number of machines to a central server. There are multiple ways to do hierarchical backups with btrfs.

snapshots and rsync on top

http://marc.merlins.org/linux/talks/Btrfs-LC2014-JP/html/img33.html

not great because of COW relationship lost to unix tools,

not great because backing up server on another one requires a lot more snapshots of snapshots for btrfs send of old things that never change

works for deduping data that partially changes or changes owners

cp -a --link + rsync

http://marc.merlins.org/linux/talks/Btrfs-LC2014-JP/html/img34.html

newer btrfs should be ok with many hardlinks

du can figure out the data saved

you can use hardlinks.py instead of bedup

can be transferred via cp/rsync without losing links

but hardlinks will not work across subvolumes

cp -a --reflink + rsync

http://marc.merlins.org/linux/talks/Btrfs-LC2014-JP/html/img35.html

very nice, does not require hardlinks which you can't do across subvolumes