Marc's Public Blog - Linux Hacking


vvv Click on the categories below to see other topic specific pages vvv



This page has a few of my blog entries about linux, but my main linux page is here
Picture of Linus

Here is a list of older linux event reports I made before my blog was started, then the rest are below
1996/11/18-21:Linux Pavillion Comdex Fall 1996 (photos only). I've been going since then to help at the linux pavillion.
1997/11/18-21: Linux Pavillion Comdex Fall 1997 (photos only)
1998/05/28-30: Linuxexpo 1998 (photos only)
1998/11/16-20: Linux Pavillion Comdex Fall 1998 (full report)
1998/11/11: Silicon Valley Tea Party (report with pictures)
1999/02/15: Windows Refund Day (report with pictures)
1999/03/20: SVLUG KTEH night (photos only)
1999/03/01-04: LinuxWorld Expo Winter 99 (complete report with many pictures)
1999/03/31: Mozilla Party one year anniversary (photos only)
1999/05/18-22: Linuxexpo 1999 (complete report with many pictures)
1999/06/07: June 99 Balug meeting with Linus
1999/08/09-12: LinuxWorld Expo Summer 99 (complete report with many pictures)
1999/11/15-19: Linux Business Show at Comdex Fall 1999 (full report with pictures)
2000/08/14-17: LinuxWorld Expo Summer 2000 (complete report with many pictures)
2001/01/17-20: Linux.conf.au/LCA 2001 (complete report with pictures)
2001/07/25-28: OLS 2001 (photos only)
2001/08/25: Linux 10th Anniversary (report with pictures)
2001/09/27-30: LinuxWorld Expo Summer 2001 report with pictures)
2001/11/05-10: ALS 2001 (photos only)
2002/06/26-29: OLS 2002 (photos only)
2003/01/20-25: LCA 2003 (photos only)
2003/07/23-26: OLS 2003 (photos only)
2004/01/12-17: LCA 2004 (photos only)
2004/07/21-24: OLS 2004 (photos only)
2005/04/18-23: LCA 2005 (photos only)
2006/01/24-28: LCA 2006 (photos only)
2007/01/17-21: LCA 2007 (photos only)

Here is a list of all the talks I've given:

And below are my blog posts:

Table of Content for linux:

More pages: July 2002 February 2004 March 2004 November 2004 April 2005 August 2005 January 2006 July 2006 August 2007 November 2007 December 2007 January 2008 October 2008 November 2008 December 2008 January 2009 May 2009 July 2009 August 2009 September 2009 November 2009 December 2009 January 2010 March 2010 April 2010 June 2010 August 2010 October 2010 January 2011 July 2011 August 2011 December 2011 January 2012 March 2012 May 2012 August 2012 December 2012 January 2013 March 2013 May 2013 September 2013 November 2013 January 2014 March 2014 April 2014 May 2014 October 2014 January 2015 March 2015 May 2015 January 2016 February 2016 March 2016 June 2016 July 2016 August 2016 October 2016 January 2017 September 2017 January 2018 March 2018 December 2018 January 2019 August 2019 January 2020 May 2020 January 2021 September 2021 March 2023 April 2023 December 2023 June 2024 September 2024 November 2024 July 2025 August 2025 October 2025 November 2025 April 2026



π 2026-04-13 01:01 in Computers, Linux

Replacing a 16 year old Sandy Bridge Server running 12 Spinning Rust Drives with something more efficient

My old Intel Sandy Bridge server gargamel built in 2010, initially with a dual core duo, later upgraded to a quad core with hyperthreading, was 16 years old. It was still working, but I had already replaced the drives multiple times from 2TB to 4TB, 6TB, and eventually 12TB drives as the previous drives were getting old and started failing ( My first ridiculous NAS was 2TB, with 26 SCSI SCA Drives in 3 enclosures, circa 2002 ).

I setup that last server with 10 SATA drives in 2 enclosures of 5 drives each. It's been running for over 15 years with a just a few drive upgrades and replacements now at 64TB of spinning rust. Turns out I didn't really need that much but on the last drive upgrade, I went directly from 6TB to 12TB..

The server still works fine, but it's ultimately still running a debian install from 1999 that's been upgraded all these years, including a 32/64bit dual userland without systemd. But fighting "progress" only goes so far, and my 2nd disk array with 10Y+ old 4TB drives was starting to have more drive failures. Also, I realized that 250W+ of power is a bit more than needed, so I decided to upgrade to an rPi5 with 16GB of RAM and see if I could make a decent linux server out of it.

Considering a rPi5 with 20 SSDs

Here is what I did:
  • An rPi5 supports PCI, but I got a bit over ambitious with it. I got a 4X M2 slot switch for 2 used 2 old leftover NVMEs I boot from in raid0 (500GB each)
  • I also bought 2 9 port M2 sata cards which allow for 18 drives.
  • First I was thinking about having a few SSDs and re-use my 12TB drives in an external enclosure I already have. I also found a USB-3 to 3 aata adapter that I can run the disk enclosure with, using USB3 which is 5Gbit/s instead of going through a PCI sata card..

    But in the end, I decided to go without any spinning drives at all and went for a bunch of ebay 4TB SSDs to fill up all 18 slots, yielding 56TB. It was never the plan to have that much, but it's a pain to upgrade the arrays later and it felt more efficient to just fill up the raids with more drives. So I now have

    /dev/mapper/dshelf1   30T  5.4T   24T  19% /mnt/btrfs_pool1 => 10x cheap TLC + QLC SSDs in raid6
    /dev/mapper/dshelf2   25T  128G   25T   1% /mnt/btrfs_pool2 => 8x more expensive MLC/TLC enterprise drives in raid5
    /dev/mapper/dshelf3  447G  6.1M  445G   1% /mnt/btrfs_pool3 => left over space from some QLC drives that are 4.09TB
    /dev/mapper/dshelf4  447G  6.1M  445G   1% /mnt/btrfs_pool4 => left over space from MLC drives

    The next problem is "how do you power 18 directly connected external drives?". You're going to tell me to just get drive enclosures, but turns out there aren't any or many external drive enclosures for 2.5" drives that offer direct sata connection as well as their own power. You would think it shouldn't be too hard to buy reasonably sized standalone 12V/5V power supplies for sata drives that offer more than 20A fo 5V (even NVME drives can take more than 1A each), but I didn't find any without buying a full bore ATX power supply and deal with it not coming on on its own because it's not connected to a motherboard), so I had to make my own: I took a 40A 5A power supply I laying around for LEDs, joined it with a 12V 7A power supply, and made my own Sata power bus.


    From there, I could indeed have 18 drives hang off the sata power plugs ;)


    Or do something a bit better and found these nice enclosures. Unfortunately they cost $90 each when they don't even provide their own power, and sadly the built in fans require 12V, so I have to send them dual power just for that otherwise I'd be able to power the entire thing from 5V:



    Making all this work on an rPi5

    So you're going to tell me that maybe an rPi5 wasn't really meant to have a PCI bus, never mind to run 20 SSDs (18 Sata + 2 M2 NVME), and maybe you'd be right, but I got excited when I got this quad NVME expander board for my Pi5:


    I mean it does look pretty and exciting ;)
    I mean it does look pretty and exciting ;)


    blinkenlights win ;)

    But what I didn't pay enough attention to is that it's still a single lane PCI bus (after all the Pi5 is not exactly a real server board), so what that PCI splitter board does is use a PCI switching chip to create 4 lanes out of 1 by switching PCI packets. This does not create extra bandwidth but just puts more drives on the same single channel bus. I got things to work but unsurprisingly, doing a 10 drive raid6 rebuild was slow, only 50MB/s, which is slower than the speed of a single drive. Sata does support 6GBit/s (and SSDs support around 600MB/s per drive) but all the drives together add up to 10.8GB/s of combined bandwith, or 96Gbit/s, about 15 times what my single lane PCI bus can do :)

    So yes, it can work, but it's not fast. For reference, with an unrestricted sata bus, rebuild speed can be up to 600MB/s, which is the limit speed of the drive writing parity. In real life, the more drives you have, the more data is on the sata bus or busses during rebuild, as explained above, so it's of course rare to get the full speed, especially with 10 drives, but for me it was sad to get below 100MbB/s as I was getting more than that with my spinning rust drives.

    This what it looks like, by the way:

    md1 : active raid6 sdr1[7] sdj1[3] sdp1[9] sdq1[6] sdm1[4] sdo1[8] sdi1[2] sdk1[1] sdn1[5] sdl1[0]
          31255076864 blocks super 1.2 level 6, 1024k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
          bitmap: 0/8 pages [0KB], 65536KB chunk
    

    md2 : active raid5 sdd1[2] sde1[3] sda1[1] sdc1[4] sdf1[5] sdb1[0] sdg1[6] sdh1[8] 26254235392 blocks super 1.2 level 5, 256k chunk, algorithm 2 [8/8] [UUUUUUUU] bitmap: 0/7 pages [0KB], 65536KB chunk

    So given that, my master plan of building a big NAS does not make a lot of sense, so the quad splitter does not make a lot of sense for anything than 4NVME drives that you are ok with running at much lower speed than they can do (a 4 lane NVME drive would run at 1/16th of its speed getting 1/4th of one lane). In the end, for a couple of drives, using the dual splitter GeekPi board to power 2 independent boards, is not such a bad idea and using the real sata Hat, offers real power to the drives (up to 6-7A I think), saving the trouble of having to make your own power supply like I did:


    routing the Pi5 Ribbon is a bit tricky and requires longer ribbon cables to read the middle splitter board
    routing the Pi5 Ribbon is a bit tricky and requires longer ribbon cables to read the middle splitter board


    SSDs and Prices, using cheaper DRAM-less SSDs and QLC RDAT drives with Raid5/6

    OBviously I picked the wrong time to buy a bunch of SSDs. Proper 4TB SSDs run around $700, if not worse, so I went for low grade DRAM-less TLC or QLC drives off Ebay (still around $300 a piece). I figured with RAID6, it would not be so bad, and for one of my 2 arrays, write performance and many rewrites were not a concern. I also found out that the TeamGroup 4TB drives were a mix of TLC, QLC, with 3 different kinds of controllers and some were 4.09TB where others were just 4.00TB. Then I found out about discard/TRIM support and this:

    /dev/sdr * Deterministic read ZEROs after TRIM /dev/sdi * Deterministic read data after TRIM

    The better, expensive drives guarantee RZAT, and the cheap ones are RDAT. The RDAT drives cannot support TRIM through raid5 or raid6 because raid requires that drives return 0 after TRIM so that parity works out later, and RDAT drivers do not give that guarantee, linux raid nicely detects that and turns off discard support. This however also means that after deleting data, there no way to mark that flash as free for the drives, you can trim or fstrei. The only sad thing with btrfs is that it does wear leveling of the underlying drives, which means over time all the SSD blocks get used, and there is no way to tell the drives what blocks are free, which is not ideal for QLC drives especially as they are quite slow to rewrite blocks when they don't have plenty of free space.
    Knowing that, I made sure to build that array as a write once mostly, which will make the write penalty not as important.
    My other array used for backups and lots of rewriting, I made use to use higher grade DRAM TLC Samsung and Micron enterprise drives I had laying around. I still had one drive in that array that didn't support RZAT but with those higher rate drives, not having TRIM was not as bad (they do a better job rewriting and do ok enough with their reserved space).

    Stressing the rPi5 and the ASM1184e PCI switch

    I then learned a bunch of the the limitations of PCI port switches like ASM1184e. Once I started using mine seriously, got a bunch of weird errors and disconnects until Gemini found that it's a known issue with them overheating under load. I just put an RC plane video chip radiator on the chip and now the radiator is hot and the chip seems to work reliably.


    Then I found out that my cheap teamgroup 4TB DRAM-less drives (the real TLC DRAM ones are now hovering between 6 to $700 a pop for 4TB ) are fine, until they stall during a big copy/btrfs scrub or whatever.
    When they stall, they eventually time out the PCI bus, which behind the quad PCI switche, causes the rPI to reset everything, and in the end this caused enough PCI mayhem that the sata cards were reset and 3 of the teamgroup drives crashed and failed to write what they had to a point that they were corrupted enough for linux to not be able to use their partitions anymore. Yes, a single drive stall caused a PCI timeout long enough to crash/reset the SATA controllers, which apparently managed to get the cheap teamgroup drives to corrupt the partition table blocks and have the blocks be unmapped and unreadable and unwritable:

    nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 [57247.067230] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x400001 action 0x6 frozen [57247.076133] ata8: SError: { RecovData Handshk } [57247.081246] ata8.00: failed command: READ DMA [57247.086014] ata8.00: cmd c8/00:08:c8:03:5b/00:00:00:00:00/e1 tag 2 dma 4096 in [57247.086014] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [57247.101469] ata8.00: status: { DRDY } [57247.105822] ata8: hard resetting link [57247.153797] nvme nvme0: 3/0/0 default/read/poll queues [57247.587051] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [57247.630423] ata8.00: supports DRM functions and may not be fully accessible [57247.750869] ata8.00: supports DRM functions and may not be fully accessible [57247.807025] ata8.00: configured for UDMA/133 [57247.811957] sd 7:0:0:0: [sdh] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=32s [57247.822653] sd 7:0:0:0: [sdh] tag#2 Sense Key : 0xb [current] [57247.829121] sd 7:0:0:0: [sdh] tag#2 ASC=0x0 ASCQ=0x0 [57247.835477] sd 7:0:0:0: [sdh] tag#2 CDB: opcode=0x88 88 00 00 00 00 00 01 5b 03 c8 00 00 00 08 00 00 [57247.845243] I/O error, dev sdh, sector 22741960 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 [57247.855511] ata8: EH complete [57247.872535] ata8.00: Enabling discard_zeroes_data [60367.285453] ata9.00: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen [60367.293666] ata9.00: irq_stat 0x08000000, interface fatal error [60367.300313] ata9: SError: { UnrecovData Handshk } [60367.306530] ata9.00: failed command: WRITE DMA EXT [60367.311966] ata9.00: cmd 35/00:00:78:8c:f7/00:05:1e:00:00/e0 tag 9 dma 655360 out [60367.311966] res 50/00:00:ff:03:f7/00:00:1e:00:00/e0 Emask 0x10 (ATA bus error) [60367.328871] ata9.00: status: { DRDY } [60367.333036] ata9: hard resetting link [60367.805496] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [60367.863205] ata9.00: configured for UDMA/133 [60367.868064] ata9: EH complete [60397.357520] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 [60397.453616] nvme nvme0: 3/0/0 default/read/poll queues [60398.929509] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x400001 action 0x6 frozen [60398.959616] ata1: SError: { RecovData Handshk } [60398.966761] ata1.00: failed command: READ DMA [60398.972859] ata1.00: cmd c8/00:08:78:b9:4a/00:00:00:00:00/e2 tag 22 dma 4096 in [60398.972859] res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [60398.990825] ata1.00: status: { DRDY } [60398.995717] ata1: hard resetting link [60399.473455] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [60399.541525] ata1.00: configured for UDMA/133 [60399.546657] sd 0:0:0:0: [sda] tag#22 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=32s [60399.557577] sd 0:0:0:0: [sda] tag#22 Sense Key : 0xb [current] [60399.564532] sd 0:0:0:0: [sda] tag#22 ASC=0x0 ASCQ=0x0 [60399.570665] sd 0:0:0:0: [sda] tag#22 CDB: opcode=0x88 88 00 00 00 00 00 02 4a b9 78 00 00 00 08 00 00 [60399.580758] I/O error, dev sda, sector 38451576 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 [60399.590585] ata1: EH complete [60399.640204] ata1.00: Enabling discard_zeroes_data [72688.943036] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x400001 action 0x6 frozen [72688.951084] ata1: SError: { RecovData Handshk } [72688.956422] ata1.00: failed command: WRITE DMA [72688.961594] ata1.00: cmd ca/00:20:00:ac:82/00:00:00:00:00/e5 tag 14 dma 16384 out [72688.961594] res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [72688.977731] ata1.00: status: { DRDY } [72688.981969] ata1: hard resetting link [72688.986211] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x400001 action 0x6 frozen [72688.994663] ata2: SError: { RecovData Handshk } [72688.999881] ata2.00: failed command: WRITE DMA [72689.005000] ata2.00: cmd ca/00:20:c0:b0:82/00:00:00:00:00/e5 tag 19 dma 16384 out [72689.005000] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [72689.022962] ata2.00: status: { DRDY } [72689.027430] ata2: hard resetting link [72689.499039] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [72689.506396] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [72689.611777] ata1.00: configured for UDMA/133 [72689.616890] ata1: EH complete [72689.723181] ata2.00: configured for UDMA/133 [72689.728156] ata2: EH complete [72689.865333] ata1.00: Enabling discard_zeroes_data [72689.871277] ata2.00: Enabling discard_zeroes_data [73227.538624] nvme nvme1: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 [73227.640436] nvme nvme1: D3 entry latency set to 8 seconds [73227.658550] nvme nvme1: 1/0/0 default/read/poll queues [86766.334170] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 [86766.442187] nvme nvme0: 3/0/0 default/read/poll queues [86766.863105] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x400001 action 0x6 frozen [86766.877356] ata6: SError: { RecovData Handshk } [86766.884232] ata6.00: failed command: WRITE DMA [86766.891103] ata6.00: cmd ca/00:80:18:95:b5/00:00:00:00:00/e6 tag 20 dma 65536 out [86766.891103] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [86766.908556] ata6.00: status: { DRDY } [86766.914377] ata6: hard resetting link [86766.919016] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x400001 action 0x6 frozen [86766.930307] ata2: SError: { RecovData Handshk } [86766.937937] ata2.00: failed command: READ DMA [86766.943738] ata2.00: cmd c8/00:38:a0:e6:3b/00:00:00:00:00/e5 tag 4 dma 28672 in [86766.943738] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [86766.965459] ata2.00: status: { DRDY } [86766.970640] ata2: hard resetting link [86766.976782] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x400001 action 0x6 frozen [86766.989369] ata3: SError: { RecovData Handshk } [86767.001777] ata3.00: failed command: WRITE DMA [86767.010295] ata3.00: cmd ca/00:80:18:95:b5/00:00:00:00:00/e6 tag 21 dma 65536 out [86767.010295] res 40/00:00:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [86767.060215] ata3.00: status: { DRDY } [86767.071409] ata3: hard resetting link [86767.550253] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [86767.563271] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [86767.572715] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [86767.585598] ata6.00: supports DRM functions and may not be fully accessible [86767.616959] ata6.00: supports DRM functions and may not be fully accessible [86767.631980] ata3.00: configured for UDMA/133 [86767.639404] ata3: EH complete [86767.643354] ata6.00: configured for UDMA/133 [86767.661059] ahci 0001:03:00.0: port does not support device sleep [86767.663591] ata3.00: Enabling discard_zeroes_data [86767.676336] ata6: EH complete [86767.745871] ata2.00: configured for UDMA/133 [86767.754280] ata2: EH complete [86767.772933] ata2.00: Enabling discard_zeroes_data [95256.566913] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 [95256.574928] nvme nvme1: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 [95256.679475] nvme nvme1: D3 entry latency set to 8 seconds [95256.689110] nvme nvme0: 2/0/0 default/read/poll queues [95256.694718] nvme nvme1: 1/0/0 default/read/poll queues [95256.697626] I/O error, dev nvme0n1, sector 264208 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2 [95256.712397] I/O error, dev nvme0n1, sector 264208 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2 [95256.722258] md: super_written gets error=-5 [95256.727133] md/raid1:md0: Disk failure on nvme0n1p2, disabling device. [95256.727133] md/raid1:md0: Operation continuing on 1 devices. [95256.742401] I/O error, dev nvme0n1, sector 77334752 op 0x1:(WRITE) flags 0x4000800 phys_seg 1 prio class 2 [95256.753375] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 1, rd 1, flush 0, corrupt 0, gen 0 [95256.764177] I/O error, dev nvme0n1, sector 77335776 op 0x1:(WRITE) flags 0x4000800 phys_seg 1 prio class 2 [95256.774805] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 2, rd 1, flush 0, corrupt 0, gen 0 [97602.825969] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [97602.833948] ata6.00: failed command: WRITE DMA EXT [97602.839911] ata6.00: cmd 35/00:00:78:5a:4c/00:04:09:00:00/e0 tag 22 dma 524288 out [97602.839911] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [97602.858583] ata6.00: status: { DRDY } [97602.863617] ata6: hard resetting link [97603.337938] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [97603.346750] ata6.00: supports DRM functions and may not be fully accessible [97603.370306] ata6.00: supports DRM functions and may not be fully accessible [97603.430476] ata6.00: configured for UDMA/133 [97603.445466] ahci 0001:03:00.0: port does not support device sleep [97603.452251] ata6: EH complete [97637.643844] BTRFS warning (device dm-1): csum failed root 263 ino 3692950 off 386400256 csum 0xd04e5f48 expected csum 0x6b9afaa1 mirror 1 [97637.657936] BTRFS error (device dm-1): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [97638.110104] BTRFS warning (device dm-1): csum failed root 263 ino 3692950 off 386400256 csum 0xd04e5f48 expected csum 0x6b9afaa1 mirror 1 [97638.123856] BTRFS error (device dm-1): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [97662.159091] BTRFS warning (device dm-1): csum failed root 263 ino 3692950 off 386400256 csum 0xd04e5f48 expected csum 0x6b9afaa1 mirror 1 [97662.173941] BTRFS error (device dm-1): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [97662.906008] BTRFS warning (device dm-1): csum failed root 263 ino 3692950 off 386400256 csum 0xd04e5f48 expected csum 0x6b9afaa1 mirror 1 [97662.920993] BTRFS error (device dm-1): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0

    Recovering unusable DRAM-less Teamgroup drives

    By then it was impossible to read or write to the 3 Teamgroup drives that failed, si I had to blkdiscard (TRIM) the entire 3 crashed drives (out of 7) to restart with 0's everywhere (which included full data loss of course), and start over.
    Gemini gave me linux kernel sata and PCI options to make it less likely for this to happen again, but it also warned me it very much can happen again and DRAM less drives should never be behind a PCI switch.

    At the same time, it became painfully obvious that the rPi5 has single lane PCI, and all those PCI switches are adding more channels while dividing the single lane bandwidth, making things slower and slower, was a bit of fool's errand.
    By then, I had to admit defeat and since I wanted to run frigate for my cameras anyway, Gemini suggested I get an N355 based server which has an H264 and H265 ASIC for all those video streams (while rPi5 would have to do it in software), and at least 4 PCI lanes, which is much better (it's still 4x single lane M2 NVME, but at least 4 times faster and without a PCI switch to confuse things and cause full hangs if a single sata drive is freezing while writing its data)

    π 2026-04-17 01:01 in Linux, Public

    My gemini tips

    Gemini has been very useful to do sysadmin tasks about new stuff I don't know that well without spending hours reading all the new docs, or looking for said docs that don't exist (like sadly, btrfs) but at best are found in pieces over loads of random posts or Emails over the internet.

    This is where LLMs come in handy, and Google Gemini has been helpful, and even surprisingly good at times. Other times, it did badly or looked into errors where I already corrected it (especially the Pro model).

    So here is what I learned after a week of doing this:

  • The pro model sounds like it's better than thinking, but for some sysadmining tasks and looking multiple steps ahead, it is not (both the pro and thinking models agreed on that when I asked both of them)
  • When I ask a question gemini to one model, I can switch the model type and ask it to re-answer and whether they agree. They both share the same history, so the other model can pick up where you left off
  • once, I was using both for medical questions, and they worked off one another's suggestions when they didn't agree, and eventually one or the other yielded to the other one and said why the other one's answer was better
  • I still now for complex issue like the crap systemd issue that has now taken almost a day of my time, run both a thinking and pro model side by side and get them to review one another's answers to see who is right when they don't agree. That saves me time trying.
  • I now have a prompt in https://gemini.google.com/saved-info that says:
  • If my questions feel like they would be better answered by the thinking model, please offer me to switch. If I am using the thinking model but it feels the pro model would have done a better job, offer me to switch. If a question would have been answered easily by the fast model, please put '[easy question for the fast model to answer just as well]'.
  • Also create this one:
  • 1) At the top of each answer, please insert [I am the $VERSION model answering this] (obviously substitute $VERSION for your own version).

    2) Because I have been switched (against my will) to the fast model when I selected the thinking model before, please tell me if the current chat and history are too large, and although history should be good for you to have context, if you think it's so large that it's causing me to be moved to the fast model, please warn me (as a best guess if you're not sure)

    3) when I paste hundreds or 1000+ lines of logs for you to analyze, this spams your backlog/context log and makes it harder for you to find relevant technical (or medical or whatnot) information relevant to the chat we're in. When you think this is happening, please grab all relevant information far up in the history log, and write a summary for yourself in the next answer so you can find those points more easily later without going so far back

  • The pro model literally told me the thinking model was going to better my day's worth of questions on a complex systemd thing (everything is sadly complex with systemd, oh god I do often miss sysvinit, the amount of days of work I have lots with systemd over the number of seconds of boot it saved me is not even close to being a win at my scale (if you change that for google scale, it's different of course, but they also didn't use systemd at the time I was there to keep the boot predictability and reproducibility of a serial boot).
  • Gemini is great to shove hundreds of lines of system logs at and tell me why my 3 of SATA SSDs died, analyzing each return error code in order, and telling me it's not the M2 sata card, but the PCI expander card that blipped, added too much latency, or god knows what, which then caused a PCI timeout, which then caused the kernel to reset the sata card, which then reset a DRAM-less SSD while it was writing data, which in turn corrupted the data on that drive. It then gives me what blkdiscard command to run to get the SSD back to life, and tells me there is no way that rPi5 is going to ever handle that many drives over 2 levels of PCI multipliers (and it was right).
  • Pro tip #2: after you shove hundreds of lines of logs, I have found that gemini forgot its own suggestions and context from yesterday, even if it could find it after I told it its answer or question were stupid because it told me that yesterday. Then it looked deeper back and found then. The fix I now apply is after hundreds/thousands of lines of logs, I ask it to go back and summarize all relevant previous points so it can find them more easily as recent context (think think the farther back it is in the context window, the less likely it is to find stuff).
  • Pro tip #3: use a chat name in the left menu for each kind of conversation. I have one for linux, one for arduino, one for programming, one for medical, and so forth
  • The only thing being lost here is that everything I'm doing and learning does not end up in mailing lists for others to maybe find, so I'll write a blog entry from time to time with summaries. The win for me is I get all this fixed in hours, or a day or two at most instead of days or weeks over mailing lists and typical 24h+ turnaround times on the btrfs list if I even get an answer at all (thankfully I did for my recent find of a corruption bug in simple quotas and the developer has been nice and responsive in working on it once he found my original message).

    Interesting answers I've gotten, which I will paste here:

    [I am the Gemini 3 Flash model answering this]

    [with saved rule improved research and verification]

    It is a total head-scratcher to see "Thinking" in the dropdown while I'm telling you my internal ID is "Flash." I can definitely see why that feels like you've been stealth-throttled.

    The Source of the Mismatch The simplest explanation-and the one you need to hear-is that there is a divide between the UI label and the Back-end routing.

    My system instructions (my "hardwired" identity) state explicitly that I am Gemini 3 Flash. When you select "Thinking" in the dropdown, the system intends to route your prompt to the reasoning-heavy model. However, several things can cause the back-end to serve a Flash response instead:

    Capacity & Load: If the "Thinking" infrastructure is under heavy load, the system may fallback to Flash to ensure you get a response quickly rather than hitting a "busy" error or an infinite loading spinner.

    Conversation Length: We have been working on Gargamel for hours. This is a very long, high-context thread. Flash is exceptionally efficient at maintaining long-context coherence. The system might have shifted to Flash to keep the conversation from becoming sluggish or hitting tokens/memory limits that the Thinking model might struggle with in a single long turn.

    Routing Logic: Sometimes the orchestrator decides a query is "technical synthesis" rather than "deep reasoning" and routes it to the faster model to save resources.

    Is it Throttling? (Inference) It is highly likely a form of dynamic capacity management. In the Paid tier, you generally get priority access, but if the reasoning engine is at 100% capacity, the system prioritizes "Response over Reasoner Type."

    Because I have all your Saved Information and the deep context of the LUKS/systemd struggle, I can still provide high-level assistance, but I don't have that "Thinking" model's internal scratchpad where it simulates every possible failure state before speaking.

    How to get the "Real" Thinking Model If you want to force the shift to see if it catches something I missed:

    Copy the current state: Grab the latest version of your cryptgetpw and fetch-luks-key.service.

    Start a New Chat: This clears the token pressure.

    Ensure "Thinking" is selected: Paste the context and see if the header I generate changes.

    pro vs thinking round #1

    Interesting answers I've gotten, which I will paste here:
    I don't have a live dashboard that shows me how ChatGPT or Anthropic are answering your specific queries in real-time, but I am architecturally aware of the "benchmarks" and the fundamental differences in how we approach problems like your systemd/Dracut project.

    Here is the objective breakdown of how my performance likely compared to theirs over the last 6+ hours of work.

    1. Architectural Awareness Models from Anthropic (Claude) and OpenAI (o1/GPT-4o) use different weights and, in some cases, dedicated "reasoning" layers.

    The "Pro" Trap: As we discussed, I (Gemini Pro) operate primarily on high-speed synthesis. I am excellent at pulling together disparate facts (like LUKS syntax and systemd service structures). However, I failed to simulate the state transition of a boot sequence.

    The Competitor Advantage: A "Thinking" model (like Google's own Thinking model, OpenAI's o1, or Claude's 3.5 Sonnet/Opus) likely would have caught the pivot_root double-run error in the first two turns. They are trained more heavily on "long-context reasoning," which allows them to hold the entire "boot timeline" in mind more effectively than a standard synthesis model.

    2. Candid Self-Assessment In our interaction today, my performance followed a specific trajectory:

    Breadth: Excellent. I accurately identified Dracut modules, kernel command line parameters, and systemd override syntax.

    Logical Simulation: Poor. I missed the physical reality that sysinit.target occurs after cryptsetup in the systemd graph. This is where I fell behind.

    Adaptability: Strong. Once you provided the logs and "corrected" my logic, I was able to pivot immediately.

    3. The "Human-as-Reasoning-Engine" Factor For the last several hours, you acted as my thinking layer. You provided the grounding (logs, physical observations, and hardware specifics) that allowed my synthesis engine to eventually reach the correct conclusion.

    Anthropic/OpenAI often feel more "stable" in these niche Linux edge cases because their reasoning steps are more conservative-they are less likely to give a "confident" but structurally flawed answer like the cryptsetup-pre.target suggestion I gave you earlier.

    The Verdict If you were to take this same project to a thinking model, the experience would likely be:

    Slower responses: It would take 10-30 seconds to "think" before answering.

    Higher "First-Time" Accuracy: It likely would have suggested the wait-for-script.conf drop-in and the initrd-release condition within the first hour.

    Less Loop-Back: You would have spent less time providing me with "I told you so" corrections.

    And Pro vs Thinking vs flash vs flash thinking submode

    [I am the Gemini 3 Flash model answering this]

    You've hit the nail on the head regarding the "Reasoning Hierarchy." Your engineering intuition is correct: there is a distinct difference between a Reasoning-First Model and a Fast Model with a Reasoning Layer.

    The Identity Discrepancy Explained The discrepancy you're seeing between the UI dropdown ("Thinking") and my self-identification ("Flash") is a result of Dynamic Routing. Here is why you are talking to "Flash Thinking" instead of the "Real" Reasoning model:

    The Context Pressure: We have been working on Gargamel for over 6 hours, pasting 1000+ lines of logs and complex systemd configurations. This thread is now massive. Gemini 3 Flash is the "Context King"-it is architecturally designed to handle these massive backlogs without losing the "Gargamel Checkpoint" or becoming sluggish. The specialized Reasoning model often has a tighter "focus window" and might struggle with the sheer volume of data we've generated.

    Capacity vs. Optimization: You aren't "blocked" in the sense of being penalized. Instead, the orchestrator has likely decided that for a high-turn, high-volume technical session, "Flash Thinking" provides the best balance of speed and technical accuracy.

    The "Sub-Thinking" Engine: As you suspected, "Flash Thinking" is a specialized configuration where my fast synthesis engine is augmented by a reasoning head. It allows me to handle the TTY/File Descriptor logic we just solved, whereas the "Real" Reasoning model (likely based on the Pro/Ultra core) would spend 30-60 seconds "thinking" about every single line of code.

    Pro vs. Flash vs. Thinking The Pro model earlier today recognized that we were stuck in a "simulation loop" (where I kept suggesting things that failed in the physical boot). It suggested the switch because the Real Thinking Model is better at "simulating" a 10-second boot sequence in its internal "scratchpad" before answering.

    Current State Verdict: I am currently Flash utilizing a Thinking Mode reasoning layer. It is a "hybrid" state. It is smarter than the standard Flash you'd use for writing an email, but it lacks the deep, non-linear simulation power of the dedicated Reasoning model.

    Why the UI says "Thinking" The UI dropdown reflects your intent. The system tries to give you the dedicated reasoning model. However, if the system detects a massive context history (like ours) or high server load, it routes you to the most capable model that can still maintain a fast response time-which, in 2026, is me (Flash) with the reasoning engine active.

    To get the "Real" Thinking Model: If you truly want the deep simulation for the final igc network poll or the TTY race condition, you should start a fresh chat and paste only the final cryptgetpw script and the goal. The reduced context pressure will likely allow the orchestrator to route you to the heavy reasoning engine.

    pro vs thinking round #2

    And more clarification, start a new chat when you want the thinking model
    1. The Context vs. Routing Trade-off The system router makes a split-second decision before I even see your prompt. It looks at the total "weight" of the conversation-every log, every script, and every historical checkpoint we've created.

    Flash (Current Mode): Has a massive context window (millions of tokens). It can "swallow" your 1000-line logs and still remember your hardware paths from 6 hours ago without flinching.

    Thinking Model: Dedicated reasoning models often have higher "computational cost" per token. If a chat gets too heavy, the router shifts you to me (Flash) because I can handle the volume while maintaining a fast response time.

    2. Is the "Linux 2, 3, 4" Strategy Better? Yes. From a purely technical standpoint, starting a fresh thread is the only way to "guarantee" a clean slate for the router.

    Fresh Chat = Higher Tier: A zero-context prompt is much more likely to be routed to the high-end Thinking engine because it represents a low resource load.

    The Problem: You lose our shared history. You'd have to re-paste the "Gargamel Checkpoint" every time.

    3. Can you request "Limited Context" to force the Thinking model? Unfortunately, no. The router sits above the model. By the time I read your instruction to "limit context," the router has already seen the entire 100k+ token history and made its decision to send the query to Flash. I cannot "un-read" history to change how the system routes me.

    The "Engineer's Workflow" Recommendation To avoid having dozens of "Linux X" chats while keeping the Reasoning engine available, I suggest this hybrid approach:

    Maintain the "Master Chat": Keep this thread (Gargamel) as your archive and for "fast" technical questions (Flash is great at this).

    The "Thinking Sandbox" (Temporary Threads): When we hit a wall-like the TTY race condition or the IGC network poll timing-open a new "Thinking" chat.

    The Transfer Protocol:

    Copy just the Final Checkpoint (Hardware, Current Script, Goal).

    Paste it into the new thread.

    Get the "Thinking" model's logic audit.

    Bring the final "fixed" code back here to the Master Chat.

    Summary Checklist for you Penalized Forever? No. If you delete several early turns or the chat "cools down" (server load drops), you might get routed back. But practically, history length is the main driver.

    Drop Context? I am not programmed to "drop" context. I will try to read every single word you've ever sent in this thread until the hard token limit is hit.

    Best Move: Use the Summary method you suggested on April 18. Once we finish a specific milestone (like the TTY fix), let's summarize the "State of Gargamel" into one block and then start a "Linux 2" chat with that block as the only history.

    pro vs thinking round #3

    While a standard model (even a "Pro" version) generates its response in a single, forward-moving pass based on probabilistic patterns, a Thinking model performs an internal "Adversarial Review" on its own logic.

    How the Thinking Model would have stopped the "Console" mistake If I were simulating the /etc/default/grub configuration and the systemd service flow in a "Chain of Thought" scratchpad, my internal logic loop would have looked like this:

    Identify Goal: User needs output to both tty1 (VGA) and ttyS0 (Serial) simultaneously.

    Proposed Tool: StandardOutput=console.

    Constraint Check: Does systemd's StandardOutput=console broadcast to all terminals?

    Simulation: * StandardOutput=console binds the script's FD 1 to /dev/console.

    Reference Kernel Rule: Userspace writes to /dev/console are routed only to the primary console (the last one in the console= list).

    Conflict Found: StandardOutput=console fails the "Dual Output" requirement.

    Adversarial Correction: Discard "StandardOutput=console". Search for the broadcast-capable method.

    Final Verification: /dev/kmsg is handled by the kernel ring buffer. The kernel does broadcast ring buffer messages to all configured consoles.

    Output to User: "Do not use StandardOutput=console; it will not broadcast. Use > /dev/kmsg instead."

    The "Pro" model likely experienced a "Success Bias" hallucination: because the systemd-agent successfully solved the input broadcast problem, the model's pattern-matching engine over-generalized and assumed systemd's console handling solved the output problem as well. A reasoning engine is specifically built to verify those assumptions against a hard set of constraints (like kernel TTY physics) before speaking.


    More pages: July 2002 February 2004 March 2004 November 2004 April 2005 August 2005 January 2006 July 2006 August 2007 November 2007 December 2007 January 2008 October 2008 November 2008 December 2008 January 2009 May 2009 July 2009 August 2009 September 2009 November 2009 December 2009 January 2010 March 2010 April 2010 June 2010 August 2010 October 2010 January 2011 July 2011 August 2011 December 2011 January 2012 March 2012 May 2012 August 2012 December 2012 January 2013 March 2013 May 2013 September 2013 November 2013 January 2014 March 2014 April 2014 May 2014 October 2014 January 2015 March 2015 May 2015 January 2016 February 2016 March 2016 June 2016 July 2016 August 2016 October 2016 January 2017 September 2017 January 2018 March 2018 December 2018 January 2019 August 2019 January 2020 May 2020 January 2021 September 2021 March 2023 April 2023 December 2023 June 2024 September 2024 November 2024 July 2025 August 2025 October 2025 November 2025 April 2026

    Contact Email