Sometimes you need to go bleeding edge to actually gain stability...
Originally I had chosen Debian as distro for my self-made NAS in order to have something rock-solid to build on. The wheezy-backports repo provided me with a sightly newer kernel, but trouble arose quickly. Apart from the fact that most of the software was ancient, the kernel seemed to have some issues with the relatively new Kabini platform:
[1834950.926512] [Hardware Error]: Corrected error, no action required.
[1834950.926578] [Hardware Error]: CPU:1 (16:0:1) MC1_STATUS[Over|CE|-|-|AddrV|-|-]: 0xd400000000000012
[1834950.926614] [Hardware Error]: MC1_ADDR: 0x00007fb027effd60
[1834950.926632] [Hardware Error]: MC1 Error: L2 TLB parity error.
[1834950.926668] [Hardware Error]: cache level: L2, tx: INSN
When I decided to replace the two 1 TiB hard drives by 5 TiB ones, the
system refused to complete the rebuild (btrfs replace
), throwing
kernel dumps like this:
[ 5734.989051] BTRFS: dev_replace from <missing disk> (devid 4) to /dev/mapper/archive4-plain) finished
[ 5734.989169] BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
[ 5734.993146] IP: [<ffffffffa0357b8d>] btrfs_kobj_rm_device+0x1d/0x40 [btrfs]
[ 5734.997154] PGD 8b192067 PUD 8b193067 PMD 0
[ 5735.001065] Oops: 0000 [#1] SMP
[ 5735.004953] Modules linked in: xt_hl ip6t_REJECT cpufreq_userspace nf_conntrack_ipv6 nf_defrag_ipv6 cpufreq_stats cpufreq_conservative ip6table_filter cpufreq_powersave ip6_tables xt_tcpudp ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables ext4 crc16 mbcache jbd2 w83627ehf hwmon_vid dm_crypt ppdev amd_freq_sensitivity kvm_amd pcspkr parport_pc radeon psmouse evdev ttm drm_kms_helper kvm drm i2c_piix4 i2c_algo_bit k10temp fam15h_power tpm_tis serio_raw tpm edac_mce_amd edac_core i2c_core parport shpchp button acpi_cpufreq processor thermal_sys btrfs xor raid6_pq dm_mod sg sd_mod crc_t10dif usb_storage crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd ahci libahci ohci_pci ehci_pci ohci_hcd xhci_hcd ehci_hcd libata usbcore r8169 scsi_mod usb_common mii
[ 5735.040557] CPU: 3 PID: 4177 Comm: btrfs Not tainted 3.16-0.bpo.2-amd64 #1 Debian 3.16.3-2~bpo70+1
[ 5735.045316] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AM1B-ITX, BIOS P1.10 02/21/2014
[ 5735.050155] task: ffff8800939a69e0 ti: ffff88008b0f0000 task.ti: ffff88008b0f0000
[ 5735.054981] RIP: 0010:[<ffffffffa0357b8d>] [<ffffffffa0357b8d>] btrfs_kobj_rm_device+0x1d/0x40 [btrfs]
[ 5735.059962] RSP: 0018:ffff88008b0f3c88 EFLAGS: 00010286
[ 5735.064861] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880139206398
[ 5735.069809] RDX: ffff88008c625810 RSI: ffff880139297400 RDI: ffff880093bd14c0
[ 5735.074716] RBP: ffff88008c419dc8 R08: 000000000000000a R09: 0000000000000000
[ 5735.079639] R10: 0000000000000505 R11: 0000000000000504 R12: ffff880093aa6800
[ 5735.084534] R13: ffff88008c419e38 R14: ffff880139297400 R15: ffff88008c625800
[ 5735.089431] FS: 00007fa8b4fd4880(0000) GS:ffff88013ed80000(0000) knlGS:0000000000000000
[ 5735.094344] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5735.099261] CR2: 0000000000000088 CR3: 000000008b191000 CR4: 00000000000407e0
[ 5735.104196] Stack:
[ 5735.109083] ffff88008c419000 ffffffffa03aec5b 000009cd8e110000 ffff88008c419100
[ 5735.114069] 000000008c419bd8 0000000000000000 0000000000000000 000001ffffcff000
[ 5735.119074] ffff88013667c421 ffff8800939a69e0 0000000000000000 ef43bae53579db43
[ 5735.124065] Call Trace:
[ 5735.129047] [<ffffffffa03aec5b>] ? btrfs_dev_replace_finishing+0x37b/0x600 [btrfs]
[ 5735.134092] [<ffffffffa03af6a0>] ? btrfs_dev_replace_start+0x350/0x460 [btrfs]
[ 5735.139136] [<ffffffffa037878d>] ? btrfs_ioctl+0x17ad/0x1ea0 [btrfs]
[ 5735.144114] [<ffffffff811c6294>] ? path_lookupat+0x74/0x770
[ 5735.149077] [<ffffffff8105beda>] ? __do_page_fault+0x29a/0x530
[ 5735.153986] [<ffffffff810f562a>] ? from_kgid_munged+0xa/0x20
[ 5735.158891] [<ffffffff811cc796>] ? do_vfs_ioctl+0x86/0x4e0
[ 5735.163756] [<ffffffff811b804f>] ? filp_close+0x5f/0x90
[ 5735.168603] [<ffffffff811ccc91>] ? SyS_ioctl+0xa1/0xc0
[ 5735.173384] [<ffffffff81548508>] ? page_fault+0x28/0x30
[ 5735.178158] [<ffffffff8154646d>] ? system_call_fast_compare_end+0x10/0x15
[ 5735.182932] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 00 66 66 66 66 90 53 48 8b bf f0 09 00 00 48 85 ff 74 1f 31 db 48 85 f6 74 14 48 8b 46 78 <48> 8b 80 88 00 00 00 48 8b 70 38 e8 33 66 ed e0 89 d8 5b c3 bb
[ 5735.193083] RIP [<ffffffffa0357b8d>] btrfs_kobj_rm_device+0x1d/0x40 [btrfs]
[ 5735.198072] RSP <ffff88008b0f3c88>
[ 5735.202963] CR2: 0000000000000088
[ 5735.225323] ---[ end trace 6a09f8fe40142527 ]---
I tracked this down to some kernel bugs describing lockups when using BTRFS compression, but found no workaround.
Then I decided to use an Arch Linux live stick for the rebuild, which went absolutely smooth, first try. The system felt so much more responsive that I decided to permanently replace Debian with Arch.
Conclusion: sometimes you need to use bleeing-edge distros to actually gain stability, especially if you have new hardware.
I know, I could have just baked my own kernel and replaced the admittedly outdated 3.16.3 by something closer to the one used in the Arch live system. But it was just less painful this way.