vvv Click on the categories below to see other topic specific pages vvv



Table of Content for computers:

More pages: April 2026 November 2025 October 2025 July 2025 November 2024 September 2024 July 2024 June 2024 April 2024 December 2023 August 2021 May 2020 August 2019 March 2016 February 2016 July 2014 March 2014 December 2013 November 2013 January 2013 August 2011 July 2011 August 2010 June 2010 May 2010 March 2010 February 2010 December 2009 November 2009 March 2009 January 2008 December 2007 November 2007 July 2002 October 2001





π 2026-04-13 01:01 in Computers, Linux

Replacing a 16 year old Sandy Bridge Server running 12 Spinning Rust Drives with something more efficient

My old Intel Sandy Bridge server gargamel built in 2010, initially with a dual core duo, later upgraded to a quad core with hyperthreading, was 16 years old. It was still working, but I had already replaced the drives multiple times from 2TB to 4TB, 6TB, and eventually 12TB drives as the previous drives were getting old and started failing ( My first ridiculous NAS was 2TB, with 26 SCSI SCA Drives in 3 enclosures, circa 2002 ).

I setup that last server with 10 SATA drives in 2 enclosures of 5 drives each. It's been running for over 15 years with a just a few drive upgrades and replacements now at 64TB of spinning rust. Turns out I didn't really need that much but on the last drive upgrade, I went directly from 6TB to 12TB..

The server still works fine, but it's ultimately still running a debian install from 1999 that's been upgraded all these years, including a 32/64bit dual userland without systemd. But fighting "progress" only goes so far, and my 2nd disk array with 10Y+ old 4TB drives was starting to have more drive failures. Also, I realized that 250W+ of power is a bit more than needed, so I decided to upgrade to an rPi5 with 16GB of RAM and see if I could make a decent linux server out of it.

Considering a rPi5 with 20 SSDs

Here is what I did:
  • An rPi5 supports PCI, but I got a bit over ambitious with it. I got a 4X M2 slot switch for 2 used 2 old leftover NVMEs I boot from in raid0 (500GB each)
  • I also bought 2 9 port M2 sata cards which allow for 18 drives.
  • First I was thinking about having a few SSDs and re-use my 12TB drives in an external enclosure I already have. I also found a USB-3 to 3 aata adapter that I can run the disk enclosure with, using USB3 which is 5Gbit/s instead of going through a PCI sata card..

    But in the end, I decided to go without any spinning drives at all and went for a bunch of ebay 4TB SSDs to fill up all 18 slots, yielding 56TB. It was never the plan to have that much, but it's a pain to upgrade the arrays later and it felt more efficient to just fill up the raids with more drives. So I now have

    /dev/mapper/dshelf1   30T  5.4T   24T  19% /mnt/btrfs_pool1 => 10x cheap TLC + QLC SSDs in raid6
    /dev/mapper/dshelf2   25T  128G   25T   1% /mnt/btrfs_pool2 => 8x more expensive MLC/TLC enterprise drives in raid5
    /dev/mapper/dshelf3  447G  6.1M  445G   1% /mnt/btrfs_pool3 => left over space from some QLC drives that are 4.09TB
    /dev/mapper/dshelf4  447G  6.1M  445G   1% /mnt/btrfs_pool4 => left over space from MLC drives

    The next problem is "how do you power 18 directly connected external drives?". You're going to tell me to just get drive enclosures, but turns out there aren't any or many external drive enclosures for 2.5" drives that offer direct sata connection as well as their own power. You would think it shouldn't be too hard to buy reasonably sized standalone 12V/5V power supplies for sata drives that offer more than 20A fo 5V (even NVME drives can take more than 1A each), but I didn't find any without buying a full bore ATX power supply and deal with it not coming on on its own because it's not connected to a motherboard), so I had to make my own: I took a 40A 5A power supply I laying around for LEDs, joined it with a 12V 7A power supply, and made my own Sata power bus.


    From there, I could indeed have 18 drives hang off the sata power plugs ;)


    Or do something a bit better and found these nice enclosures. Unfortunately they cost $90 each when they don't even provide their own power, and sadly the built in fans require 12V, so I have to send them dual power just for that otherwise I'd be able to power the entire thing from 5V:



    Making all this work on an rPi5

    So you're going to tell me that maybe an rPi5 wasn't really meant to have a PCI bus, never mind to run 20 SSDs (18 Sata + 2 M2 NVME), and maybe you'd be right, but I got excited when I got this quad NVME expander board for my Pi5:


    I mean it does look pretty and exciting ;)
    I mean it does look pretty and exciting ;)


    blinkenlights win ;)

    But what I didn't pay enough attention to is that it's still a single lane PCI bus (after all the Pi5 is not exactly a real server board), so what that PCI splitter board does is use a PCI switching chip to create 4 lanes out of 1 by switching PCI packets. This does not create extra bandwidth but just puts more drives on the same single channel bus. I got things to work but unsurprisingly, doing a 10 drive raid6 rebuild was slow, only 50MB/s, which is slower than the speed of a single drive. Sata does support 6GBit/s (and SSDs support around 600MB/s per drive) but all the drives together add up to 10.8GB/s of combined bandwith, or 96Gbit/s, about 15 times what my single lane PCI bus can do :)

    So yes, it can work, but it's not fast. For reference, with an unrestricted sata bus, rebuild speed can be up to 600MB/s, which is the limit speed of the drive writing parity. In real life, the more drives you have, the more data is on the sata bus or busses during rebuild, as explained above, so it's of course rare to get the full speed, especially with 10 drives, but for me it was sad to get below 100MbB/s as I was getting more than that with my spinning rust drives.

    This what it looks like, by the way:

    md1 : active raid6 sdr1[7] sdj1[3] sdp1[9] sdq1[6] sdm1[4] sdo1[8] sdi1[2] sdk1[1] sdn1[5] sdl1[0]
          31255076864 blocks super 1.2 level 6, 1024k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
          bitmap: 0/8 pages [0KB], 65536KB chunk
    

    md2 : active raid5 sdd1[2] sde1[3] sda1[1] sdc1[4] sdf1[5] sdb1[0] sdg1[6] sdh1[8] 26254235392 blocks super 1.2 level 5, 256k chunk, algorithm 2 [8/8] [UUUUUUUU] bitmap: 0/7 pages [0KB], 65536KB chunk

    So given that, my master plan of building a big NAS does not make a lot of sense, so the quad splitter does not make a lot of sense for anything than 4NVME drives that you are ok with running at much lower speed than they can do (a 4 lane NVME drive would run at 1/16th of its speed getting 1/4th of one lane). In the end, for a couple of drives, using the dual splitter GeekPi board to power 2 independent boards, is not such a bad idea and using the real sata Hat, offers real power to the drives (up to 6-7A I think), saving the trouble of having to make your own power supply like I did:


    routing the Pi5 Ribbon is a bit tricky and requires longer ribbon cables to read the middle splitter board
    routing the Pi5 Ribbon is a bit tricky and requires longer ribbon cables to read the middle splitter board


    SSDs and Prices, using cheaper DRAM-less SSDs and QLC RDAT drives with Raid5/6

    OBviously I picked the wrong time to buy a bunch of SSDs. Proper 4TB SSDs run around $700, if not worse, so I went for low grade DRAM-less TLC or QLC drives off Ebay (still around $300 a piece). I figured with RAID6, it would not be so bad, and for one of my 2 arrays, write performance and many rewrites were not a concern. I also found out that the TeamGroup 4TB drives were a mix of TLC, QLC, with 3 different kinds of controllers and some were 4.09TB where others were just 4.00TB. Then I found out about discard/TRIM support and this:

    /dev/sdr * Deterministic read ZEROs after TRIM /dev/sdi * Deterministic read data after TRIM

    The better, expensive drives guarantee RZAT, and the cheap ones are RDAT. The RDAT drives cannot support TRIM through raid5 or raid6 because raid requires that drives return 0 after TRIM so that parity works out later, and RDAT drivers do not give that guarantee, linux raid nicely detects that and turns off discard support. This however also means that after deleting data, there no way to mark that flash as free for the drives, you can trim or fstrei. The only sad thing with btrfs is that it does wear leveling of the underlying drives, which means over time all the SSD blocks get used, and there is no way to tell the drives what blocks are free, which is not ideal for QLC drives especially as they are quite slow to rewrite blocks when they don't have plenty of free space.
    Knowing that, I made sure to build that array as a write once mostly, which will make the write penalty not as important.
    My other array used for backups and lots of rewriting, I made use to use higher grade DRAM TLC Samsung and Micron enterprise drives I had laying around. I still had one drive in that array that didn't support RZAT but with those higher rate drives, not having TRIM was not as bad (they do a better job rewriting and do ok enough with their reserved space).

    Stressing the rPi5 and the ASM1184e PCI switch

    I then learned a bunch of the the limitations of PCI port switches like ASM1184e. Once I started using mine seriously, got a bunch of weird errors and disconnects until Gemini found that it's a known issue with them overheating under load. I just put an RC plane video chip radiator on the chip and now the radiator is hot and the chip seems to work reliably.


    Then I found out that my cheap teamgroup 4TB DRAM-less drives (the real TLC DRAM ones are now hovering between 6 to $700 a pop for 4TB ) are fine, until they stall during a big copy/btrfs scrub or whatever.
    When they stall, they eventually time out the PCI bus, which behind the quad PCI switche, causes the rPI to reset everything, and in the end this caused enough PCI mayhem that the sata cards were reset and 3 of the teamgroup drives crashed and failed to write what they had to a point that they were corrupted enough for linux to not be able to use their partitions anymore. Yes, a single drive stall caused a PCI timeout long enough to crash/reset the SATA controllers, which apparently managed to get the cheap teamgroup drives to corrupt the partition table blocks and have the blocks be unmapped and unreadable and unwritable:

    nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 [57247.067230] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x400001 action 0x6 frozen [57247.076133] ata8: SError: { RecovData Handshk } [57247.081246] ata8.00: failed command: READ DMA [57247.086014] ata8.00: cmd c8/00:08:c8:03:5b/00:00:00:00:00/e1 tag 2 dma 4096 in [57247.086014] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [57247.101469] ata8.00: status: { DRDY } [57247.105822] ata8: hard resetting link [57247.153797] nvme nvme0: 3/0/0 default/read/poll queues [57247.587051] ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [57247.630423] ata8.00: supports DRM functions and may not be fully accessible [57247.750869] ata8.00: supports DRM functions and may not be fully accessible [57247.807025] ata8.00: configured for UDMA/133 [57247.811957] sd 7:0:0:0: [sdh] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=32s [57247.822653] sd 7:0:0:0: [sdh] tag#2 Sense Key : 0xb [current] [57247.829121] sd 7:0:0:0: [sdh] tag#2 ASC=0x0 ASCQ=0x0 [57247.835477] sd 7:0:0:0: [sdh] tag#2 CDB: opcode=0x88 88 00 00 00 00 00 01 5b 03 c8 00 00 00 08 00 00 [57247.845243] I/O error, dev sdh, sector 22741960 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 [57247.855511] ata8: EH complete [57247.872535] ata8.00: Enabling discard_zeroes_data [60367.285453] ata9.00: exception Emask 0x10 SAct 0x0 SErr 0x400100 action 0x6 frozen [60367.293666] ata9.00: irq_stat 0x08000000, interface fatal error [60367.300313] ata9: SError: { UnrecovData Handshk } [60367.306530] ata9.00: failed command: WRITE DMA EXT [60367.311966] ata9.00: cmd 35/00:00:78:8c:f7/00:05:1e:00:00/e0 tag 9 dma 655360 out [60367.311966] res 50/00:00:ff:03:f7/00:00:1e:00:00/e0 Emask 0x10 (ATA bus error) [60367.328871] ata9.00: status: { DRDY } [60367.333036] ata9: hard resetting link [60367.805496] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [60367.863205] ata9.00: configured for UDMA/133 [60367.868064] ata9: EH complete [60397.357520] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 [60397.453616] nvme nvme0: 3/0/0 default/read/poll queues [60398.929509] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x400001 action 0x6 frozen [60398.959616] ata1: SError: { RecovData Handshk } [60398.966761] ata1.00: failed command: READ DMA [60398.972859] ata1.00: cmd c8/00:08:78:b9:4a/00:00:00:00:00/e2 tag 22 dma 4096 in [60398.972859] res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [60398.990825] ata1.00: status: { DRDY } [60398.995717] ata1: hard resetting link [60399.473455] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [60399.541525] ata1.00: configured for UDMA/133 [60399.546657] sd 0:0:0:0: [sda] tag#22 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=32s [60399.557577] sd 0:0:0:0: [sda] tag#22 Sense Key : 0xb [current] [60399.564532] sd 0:0:0:0: [sda] tag#22 ASC=0x0 ASCQ=0x0 [60399.570665] sd 0:0:0:0: [sda] tag#22 CDB: opcode=0x88 88 00 00 00 00 00 02 4a b9 78 00 00 00 08 00 00 [60399.580758] I/O error, dev sda, sector 38451576 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 [60399.590585] ata1: EH complete [60399.640204] ata1.00: Enabling discard_zeroes_data [72688.943036] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x400001 action 0x6 frozen [72688.951084] ata1: SError: { RecovData Handshk } [72688.956422] ata1.00: failed command: WRITE DMA [72688.961594] ata1.00: cmd ca/00:20:00:ac:82/00:00:00:00:00/e5 tag 14 dma 16384 out [72688.961594] res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [72688.977731] ata1.00: status: { DRDY } [72688.981969] ata1: hard resetting link [72688.986211] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x400001 action 0x6 frozen [72688.994663] ata2: SError: { RecovData Handshk } [72688.999881] ata2.00: failed command: WRITE DMA [72689.005000] ata2.00: cmd ca/00:20:c0:b0:82/00:00:00:00:00/e5 tag 19 dma 16384 out [72689.005000] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [72689.022962] ata2.00: status: { DRDY } [72689.027430] ata2: hard resetting link [72689.499039] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [72689.506396] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [72689.611777] ata1.00: configured for UDMA/133 [72689.616890] ata1: EH complete [72689.723181] ata2.00: configured for UDMA/133 [72689.728156] ata2: EH complete [72689.865333] ata1.00: Enabling discard_zeroes_data [72689.871277] ata2.00: Enabling discard_zeroes_data [73227.538624] nvme nvme1: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 [73227.640436] nvme nvme1: D3 entry latency set to 8 seconds [73227.658550] nvme nvme1: 1/0/0 default/read/poll queues [86766.334170] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 [86766.442187] nvme nvme0: 3/0/0 default/read/poll queues [86766.863105] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x400001 action 0x6 frozen [86766.877356] ata6: SError: { RecovData Handshk } [86766.884232] ata6.00: failed command: WRITE DMA [86766.891103] ata6.00: cmd ca/00:80:18:95:b5/00:00:00:00:00/e6 tag 20 dma 65536 out [86766.891103] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [86766.908556] ata6.00: status: { DRDY } [86766.914377] ata6: hard resetting link [86766.919016] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x400001 action 0x6 frozen [86766.930307] ata2: SError: { RecovData Handshk } [86766.937937] ata2.00: failed command: READ DMA [86766.943738] ata2.00: cmd c8/00:38:a0:e6:3b/00:00:00:00:00/e5 tag 4 dma 28672 in [86766.943738] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [86766.965459] ata2.00: status: { DRDY } [86766.970640] ata2: hard resetting link [86766.976782] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x400001 action 0x6 frozen [86766.989369] ata3: SError: { RecovData Handshk } [86767.001777] ata3.00: failed command: WRITE DMA [86767.010295] ata3.00: cmd ca/00:80:18:95:b5/00:00:00:00:00/e6 tag 21 dma 65536 out [86767.010295] res 40/00:00:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [86767.060215] ata3.00: status: { DRDY } [86767.071409] ata3: hard resetting link [86767.550253] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [86767.563271] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [86767.572715] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [86767.585598] ata6.00: supports DRM functions and may not be fully accessible [86767.616959] ata6.00: supports DRM functions and may not be fully accessible [86767.631980] ata3.00: configured for UDMA/133 [86767.639404] ata3: EH complete [86767.643354] ata6.00: configured for UDMA/133 [86767.661059] ahci 0001:03:00.0: port does not support device sleep [86767.663591] ata3.00: Enabling discard_zeroes_data [86767.676336] ata6: EH complete [86767.745871] ata2.00: configured for UDMA/133 [86767.754280] ata2: EH complete [86767.772933] ata2.00: Enabling discard_zeroes_data [95256.566913] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 [95256.574928] nvme nvme1: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10 [95256.679475] nvme nvme1: D3 entry latency set to 8 seconds [95256.689110] nvme nvme0: 2/0/0 default/read/poll queues [95256.694718] nvme nvme1: 1/0/0 default/read/poll queues [95256.697626] I/O error, dev nvme0n1, sector 264208 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2 [95256.712397] I/O error, dev nvme0n1, sector 264208 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2 [95256.722258] md: super_written gets error=-5 [95256.727133] md/raid1:md0: Disk failure on nvme0n1p2, disabling device. [95256.727133] md/raid1:md0: Operation continuing on 1 devices. [95256.742401] I/O error, dev nvme0n1, sector 77334752 op 0x1:(WRITE) flags 0x4000800 phys_seg 1 prio class 2 [95256.753375] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 1, rd 1, flush 0, corrupt 0, gen 0 [95256.764177] I/O error, dev nvme0n1, sector 77335776 op 0x1:(WRITE) flags 0x4000800 phys_seg 1 prio class 2 [95256.774805] BTRFS error (device nvme0n1p3): bdev /dev/nvme0n1p3 errs: wr 2, rd 1, flush 0, corrupt 0, gen 0 [97602.825969] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [97602.833948] ata6.00: failed command: WRITE DMA EXT [97602.839911] ata6.00: cmd 35/00:00:78:5a:4c/00:04:09:00:00/e0 tag 22 dma 524288 out [97602.839911] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [97602.858583] ata6.00: status: { DRDY } [97602.863617] ata6: hard resetting link [97603.337938] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [97603.346750] ata6.00: supports DRM functions and may not be fully accessible [97603.370306] ata6.00: supports DRM functions and may not be fully accessible [97603.430476] ata6.00: configured for UDMA/133 [97603.445466] ahci 0001:03:00.0: port does not support device sleep [97603.452251] ata6: EH complete [97637.643844] BTRFS warning (device dm-1): csum failed root 263 ino 3692950 off 386400256 csum 0xd04e5f48 expected csum 0x6b9afaa1 mirror 1 [97637.657936] BTRFS error (device dm-1): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [97638.110104] BTRFS warning (device dm-1): csum failed root 263 ino 3692950 off 386400256 csum 0xd04e5f48 expected csum 0x6b9afaa1 mirror 1 [97638.123856] BTRFS error (device dm-1): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [97662.159091] BTRFS warning (device dm-1): csum failed root 263 ino 3692950 off 386400256 csum 0xd04e5f48 expected csum 0x6b9afaa1 mirror 1 [97662.173941] BTRFS error (device dm-1): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [97662.906008] BTRFS warning (device dm-1): csum failed root 263 ino 3692950 off 386400256 csum 0xd04e5f48 expected csum 0x6b9afaa1 mirror 1 [97662.920993] BTRFS error (device dm-1): bdev /dev/mapper/dshelf2 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0

    Recovering unusable DRAM-less Teamgroup drives

    By then it was impossible to read or write to the 3 Teamgroup drives that failed, si I had to blkdiscard (TRIM) the entire 3 crashed drives (out of 7) to restart with 0's everywhere (which included full data loss of course), and start over.
    Gemini gave me linux kernel sata and PCI options to make it less likely for this to happen again, but it also warned me it very much can happen again and DRAM less drives should never be behind a PCI switch.

    At the same time, it became painfully obvious that the rPi5 has single lane PCI, and all those PCI switches are adding more channels while dividing the single lane bandwidth, making things slower and slower, was a bit of fool's errand.
    By then, I had to admit defeat and since I wanted to run frigate for my cameras anyway, Gemini suggested I get an N355 based server which has an H264 and H265 ASIC for all those video streams (while rPi5 would have to do it in software), and at least 4 PCI lanes, which is much better (it's still 4x single lane M2 NVME, but at least 4 times faster and without a PCI switch to confuse things and cause full hangs if a single sata drive is freezing while writing its data)

    π 2025-11-29 01:01 in Computers, Sciencemuseums
    I randomly happened to be a the CHM to drop off some donations, and noticed a room I didn't remember and found out they now had a fully working IBM 1402 (actually they had two).


















    Video demo:

    π 2025-11-07 01:01 in Computers, Linux, Public
    My 2 main computers have been called magic and moremagic since the late 90's. Most people do not know wy, so here is the story I read back in the 90's, reposted from http://www.catb.org/jargon/html/magic-story.html

    A Story About 'Magic'

    Some years ago, I (GLS) was snooping around in the cabinets that housed the MIT AI Lab's PDP-10, and noticed a little switch glued to the frame of one cabinet. It was obviously a homebrew job, added by one of the lab's hardware hackers (no one knows who).

    You don't touch an unknown switch on a computer without knowing what it does, because you might crash the computer. The switch was labeled in a most unhelpful way. It had two positions, and scrawled in pencil on the metal switch body were the words 'magic' and 'more magic'. The switch was in the 'more magic' position.

    I called another hacker over to look at it. He had never seen the switch before either. Closer examination revealed that the switch had only one wire running to it! The other end of the wire did disappear into the maze of wires inside the computer, but it's a basic fact of electricity that a switch can't do anything unless there are two wires connected to it. This switch had a wire connected on one side and no wire on its other side.

    It was clear that this switch was someone's idea of a silly joke. Convinced by our reasoning that the switch was inoperative, we flipped it. The computer instantly crashed.

    Imagine our utter astonishment. We wrote it off as coincidence, but nevertheless restored the switch to the 'more magic' position before reviving the computer.

    A year later, I told this story to yet another hacker, David Moon as I recall. He clearly doubted my sanity, or suspected me of a supernatural belief in the power of this switch, or perhaps thought I was fooling him with a bogus saga. To prove it to him, I showed him the very switch, still glued to the cabinet frame with only one wire connected to it, still in the 'more magic' position. We scrutinized the switch and its lone connection, and found that the other end of the wire, though connected to the computer wiring, was connected to a ground pin. That clearly made the switch doubly useless: not only was it electrically nonoperative, but it was connected to a place that couldn't affect anything anyway. So we flipped the switch.

    The computer promptly crashed.

    This time we ran for Richard Greenblatt, a long-time MIT hacker, who was close at hand. He had never noticed the switch before, either. He inspected it, concluded it was useless, got some diagonal cutters and diked it out. We then revived the computer and it has run fine ever since.

    We still don't know how the switch crashed the machine. There is a theory that some circuit near the ground pin was marginal, and flipping the switch changed the electrical capacitance enough to upset the circuit as millionth-of-a-second pulses went through it. But we'll never know for sure; all we can really say is that the switch was magic.

    I still have that switch in my basement. Maybe I'm silly, but I usually keep it set on 'more magic'.

    1994: Another explanation of this story has since been offered. Note that the switch body was metal. Suppose that the non-connected side of the switch was connected to the switch body (usually the body is connected to a separate earth lug, but there are exceptions). The body is connected to the computer case, which is, presumably, grounded. Now the circuit ground within the machine isn't necessarily at the same potential as the case ground, so flipping the switch connected the circuit ground to the case ground, causing a voltage drop/jump which reset the machine. This was probably discovered by someone who found out the hard way that there was a potential difference between the two, and who then wired in the switch as a joke.

    π 2025-10-27 01:01 in Computers, Linux
    This is part #2 of
  • Finishing Upgrade of Year 2000 Linux System From i386 to amd64 to arm64 for Raspberry Pi5 with mailman 2.1.7 for Python 2 (the last 5% that took 70% of the time)
  • as an upgrade for
  • Magic v5: From Dell Poweredge 2950 to Raspberry Pi 5 (skipping Dell DSS1510)
  • After upgrading my main server from amd64 to arm64 (rPi), I was forced to re-install all of linux, first time in 25+ years for that server, which included upgrading every single linux package I had t o Debian/Trixie (13). Those upgrades are always "interesting" when you have a lot of history and state, but turns out it went pretty well, except for exim4.

    As much as I'm thankful for exim4 and its developers, and all the work they do, I respecfully think the way they implemented tainting on $local_part, the name of the recipient, was poor and with no regard to the cost of countless admins whose configs got broken. Namely:

  • Debian literally had to write allow_insecure_tainted to avoid breaking their users overnight. They knew how bad the upgrade and breakage were going to be (sadly it was removed later and exim4 didn't use the hint to lessen the pain of upgrades)
  • Exim never provided a clear guide on the most common ways to fix this, including clear fixes for common configurations, using mailman with exim being one of them. Exim has an excellent documentation that is very extensive, but takes days to read and understand (it was over a week my first time 25 years ago). Expecting users to dig back into such a complex system many years later and figure out very non trivial config steps, is not fair in my book.
  • why is there no detailled message in exim panic_log to tell the admin what happened and what to do, along with a bounce message saying the answer is in local exim logs?
  • add a untaint() with fixed safe regex that will work for most everyone
  • the local_part_data is deep black magic and not a reasonable sole solution (it's empty and unusable by default). There should be a local_part_safe that is automatically populated via a safe regex
  • the debian answer of "turn off tainting" should honestly be a real option. Forcing admins to be broken if they have certified they are safe, or in an environment where it's really fine, is NOT an appropriate answer and honestly unfair to admins who deal with lots of things and, cannot be experts on deep internals of dozens or hundreds of daemons. Yes, that means allowing an admin who may already have been running an unsafe setup for 20 years, to potentially continue to do so if they deem it's actually ok/safe in in their setup. The admin must be trusted and not treated like a clueless person that must be blocked from running the software (breaking delivery to mailman is blocking me from using exim altogether).
  • For people who disagree with that last point, please understand that it is still there no matter what. If admins cannot untaint a safe config, they will downgrade exim, and it looks like I did exactly that in the past. This is literally the worst case scenario users are forced into if they can't figure out a very non trivial solution with very few clues
  • Exim posts:

  • https://lists.exim.org/lurker/message/20251027.164803.8ab41844.en.html
  • https://lists.exim.org/lurker/message/20251027.162524.1f7d6cf1.en.html
  • https://lists.exim.org/lurker/message/20251027.181509.83258145.en.html
  • So here is what I figured out in the end, after way too many hours (probably more than 10h at this point, which is totally not cool, uprades should not cause downtimes of 10h plus that amount of lost admin time in debugging, research, and fixing): Exim seems to have totally over-reacted to the local_part untrusted data problem, given literally no way to the admin to clean up the variable on their own with a safe regex, maybe provided by exim itself, and seems to force the admin to compare local_part against trusted data on the server only, or it will simply remain tainted and unusable. This is way over the top, especially when you can run a command in pipe without suffering from shell quoting issues.

    The solution I found after help from others, is:

    mm21_director:
      debug_print = "R: mm21_director for $local_part@$domain"
      driver = accept
      # black magic to populate local_part_data, the untainted version of local_part
      local_parts = dsearch,filter=dir;MAILMAN_HOME/lists
      require_files = MAILMAN_HOME/lists/${lc::$local_part_data}/config.pck
      local_part_suffix = "-bounces:-bounces+*:-confirm+*:-join:-leave:-owner:-request:-admin"
      transport = mm21_transport
    .endif
    

    mm21_transport: debug_print = "T: mm21_transport for $local_part@$domain" driver = pipe # In case you wonder, substr_2 removes the leading '-' # and the regex removes optional +foo=hostname that can be after -bounce # (if you use VERP) -- Marc command = MAILMAN_WRAP "${if def:local_part_suffix{${substr_2:{${sg{${lc:$local_part_suffix}}{\\\\\+.*}{}}}}{post}}" ${lc:$local_part_data} current_directory = MAILMAN_HOME home_directory = MAILMAN_HOME user = MAILMAN_UID group = MAILMAN_GID

    What I had to fix is add "local_parts = dsearch,filter=dir;MAILMAN_HOME/lists" which was 100% required for local_part_data to be populated. Without that, local_part_data is and remains NULL.
    It's disappointing how non trivial and over complicated this is, and most importantly how there was no "MUST READ THIS TAINTED UPGRADE" document with proper detailled info around this in one place (not scattered around a very big manual), along with the most common solutions to the very extreme new tainted restrictions.

    Useful links I saved along the way:

  • https://postmaster.google.com/u/5/dashboards#do=merlins.org&st=inboundDeliveryErrorRate&dr=7
  • https://mxtoolbox.com/SuperTool.aspx?action=dkim%3amerlins.org%3a20251023&run=toolpage
  • https://www.exim.org/exim-html-current/doc/html/spec_html/ch-dkim_spf_srs_and_dmarc.html
  • π 2025-10-26 01:01 in Computers, Linux
    Part #2 was unfortunately much more painful in an unnecessary way due to a poorly made forced API change in exim4

    It's been a while since I've been in XKCD 349 land :) Actually it's a good thing because honestly, it's really not fun and I enjoy other hobbies in my life, too :)


    The power of linux is I never really had to re-install my linux system I built in 2000 or so because Debian is just that good. I did do an upgrade from i386 to amd64, but that was possible thanks to biarch in debian and a fancy and impressive in place binary upgrade from ia32/i386 to amd64.

    Now, because of this little problem where my amd64 capable server from 2019 was taking way too much power (400W or so), I decided to replace it with an rPi5 which is almost 3 times faster for 20 times less power.


    Despite the different binary arch, migrating was not a huge deal, although I still had ancient stuff running python2 that took a while to upgrade, but I figured it was time to get rid of python2 which has been gone from debian for a while (I went to trixie, v13, and it was removed after bulleye, 3 versions ago).
    I was almost done with my upgrade and everything being back up, and then came the subject of mailman. Oh, no, mailman!
    I used to be a mailman expert in 1999-2000 (yes, really, haha), knew the code well, but it's been 25 years and I've kept using it to run a few lists, but otherwise haven't touched in 25 years.

    Of course, by now there is mailman3 that uses python3, but installing that on debian installed dozens of python packages, a new database system and god knows what I just didn't want or didn't need. Worse, I remembered that I have a fancy exim4 config that detects the mailman .pck files and auto provisions lists and aliases. Also, I changed the web interface a bit.

    As much as its is yucky, I'm already 3 days into this full server upgrade and not wanting to spend a day or more to learn this new mailman3 and migrate to it, simply because it's not worth my time and I'm just happy to keep my few lists running as is.

    So here is what I had to do:

    Installing python2 was not too hard, I just had to bring back an old installation for bullseye:
    

    magic:/usr/bin# cat /etc/apt/sources.list.d/debian_bullseye_python2.sources Types: deb URIs: http://deb.debian.org/debian
    Suites: bullseye Components: main contrib non-free non-free-firmware Signed-By: /usr/share/keyrings/debian-archive-keyring.pgp

    apt-get install python2.7-minimal magic:/usr/bin# ln -s python2.7 python2

    Amazingly the packages were built well enough that they installed without fuss on trixie, including some dependencies:

    moremagic:/etc/apt# apt-get install python2.7-minimal
    Reading package lists... Done
    Building dependency tree... Done
    Reading state information... Done
    The following additional packages will be installed:
      libpython2.7-minimal
    Suggested packages:
      binfmt-support
    Recommended packages:
      libpython2.7-stdlib python2.7
    The following NEW packages will be installed:
      libpython2.7-minimal python2.7-minimal
    0 upgraded, 2 newly installed, 0 to remove and 45 not upgraded.
    Need to get 1,593 kB of archives.
    After this operation, 6,393 kB of additional disk space will be used.
    Do you want to continue? [Y/n] y
    moremagic:/etc/apt#

    Now, mailman2 is python, so we're good, right? Well, not quite. There were some cgi binaries that hardcoded stuff for safety, and were obviously i386 on my system (~mailman/mail/mailman and ~mailman/cgi-bin/*).
    I did have server backups going back to 2002 (not bad, haha, and yes they really still work), so I found the source I used back then, but then I realized that trying to rebuild the whole thing might take a while since it's all ancient configure, ancient python, and so forth. Just yesterday I had to rebuild ancient C, and its bundled configure crashed because its "is gcc there" test was not compliant anymore and told me my gcc could not build binaries when in fact the configure gcc test was so old that it was broken, and I just removed it (the rest actually built).

    configure:1004: gcc -o conftest    conftest.c  1>&5
    configure:1001:1: error: return type defaults to 'int' [-Wimplicit-int]
     1001 | main(){return(0);}
          | ^~~~
    configure: failed program was:

    After the source failing to build right away due to missing ancient python stuff, I asked myself "eh, can I maybe just get those i386 binaries work on arm64 as is?". And the answer is, yes:

    magic:/var/local/mailman/mail# ./mailman 
    bash: ./mailman: cannot execute binary file: Exec format error
    

    # install binary emulator, not fast but more than good enough for my needs: magic:/lib# apt-get install qemu-user-static The following additional packages will be installed: qemu-user qemu-user-binfmt The following NEW packages will be installed: qemu-user qemu-user-binfmt qemu-user-static Do you want to continue? [Y/n] y Get:1 http://deb.debian.org/debian trixie/main arm64 qemu-user arm64 1:10.0.3+ds-0+deb13u1 [64.1 MB] Get:2 http://deb.debian.org/debian trixie/main arm64 qemu-user-binfmt arm64 1:10.0.3+ds-0+deb13u1 [2,068 B] Get:3 http://deb.debian.org/debian trixie/main arm64 qemu-user-static arm64 1:10.0.3+ds-0+deb13u1 [55.1 kB]

    magic:/var/local/mailman/mail# ./mailman i386-binfmt-P: Could not open '/lib/ld-linux.so.2': No such file or directory

    # copied over libraries from an old system: magic:/lib/i686# l -rwxr-xr-x 1 root root 171404 Oct 26 16:38 ld-linux.so.2* -rwxr-xr-x 1 root root 1993968 Oct 26 16:39 libc.so.6*

    magic:/lib# ln -s i686/ld-linux.so.2 . magic:/var/local/mailman/mail# ./mailman Usage: ./mailman program [args...]

    Success!

    Well, now when I connect, I see:

    The Mailman CGI wrapper encountered a fatal error. This entry is being stored in your syslog:
    Failure to find group name for GID 33.  Mailman
    expected the CGI wrapper to be executed as group
    "www-data", but the system's web server executed the
    wrapper as GID 33 for which the name could not be
    found.  Try adding GID 33 to your system as "www-data",
    or tweak your web server to run the wrapper as group
    "www-data".

    Now, this is actually already good: it means the CGI (i386 code) is running on arm64, but indeed there is a library issue because /etc/groups does have "www-data:x:33:". Strace showed it was looking for libnss_files.so.2, which makes sense.

    Copied over the lib magic:/lib# l /lib/i686/libnss_files.so.2

    -rw-r--r-- 1 root root 50812 Oct 26 17:45 /lib/i686/libnss_files.so.2 magic:/var/local/mailman/cgi-bin# su www-data magic:/var/local/mailman/cgi-bin$ ./listinfo File "/var/local/mailman/scripts/driver", line 107 print 'Status: 405 Method not allowed' ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)?

    Progress! (now the wrapper is running the wrong python). The easy fix is of course to make /usr/bin/python point to python2, but I was trying to resist doing so. however at this point I decided to stop being a purist, and honestly this python2/python3 stuff has cost me so much time in the past already that I'm fine with python being python2. All python3 code calls /usr/bin/python3 anyway.

    By now, things are looking better and https://lists.merlins.org/lists/listinfo is returning

    Bug in Mailman version 2.1.14
    We're sorry, we hit a bug!
    Please inform the webmaster for this site of this problem. Printing of traceback and other system information has been explicitly inhibited, but the webmaster can find this information in the Mailman error logs.

    From there, I had to debug some non trivial permission issues which I think were due to qemu not respecting the setgid bit when running i386 code.

    magic:~$ /var/local/mailman/mail/mailman post testlist
    Group mismatch error.  Mailman expected the mail
    wrapper script to be executed as group "mail", but
    the system's mail server executed the mail script as
    group "www-data".  Try tweaking the mail server to run the
    script as group "mail", or re-run configure, 
    providing the command line option `--with-mail-gid=www-data'.

    This was all because the CGIs had to be SGID mailman and therefore had to be C binaries because python suid/sgid was considered not safe at the time. This has been fixed many ways in the last 25 years, but I wanted to keep things as is without getting into new rabbiholes :)

    Sadly, it went downhill from there and the 2h rabbithole I was trying to avoid, caused me another one I fell into. But it was cool to see I could run intel binaries on rpi5/arm64 when needed
    It did how break sgid which is essential for mailman and it turned out the reasonable path of rebuilding since I did have source and even a source tree from 2002 with the right build options still baked in:

    magic:/var/local/src/mailman-2.1.7/src# make clean; make; make install
    (...)
    for f in admindb admin confirm create edithtml listinfo options private rmlist roster subscribe; do     exe=/var/local/mailman/cgi-bin/$f;     /usr/bin/install -c -m 755 $f $exe;     chmod g+s $exe; done
    for f in mailman; do     /usr/bin/install -c -m 755 $f /var/local/mailman/mail;     chmod g+s /var/local/mailman/mail/$f; done

    Yeah, that took fewer than 5mn and made native binaries. With that the web pages worked right away, but the Email gateway script was still being difficult and exim4 debugging didn't show the output from it, making it hard to debug. This does not even make it clear what the full command line was (need to go in +dall to see it, barely) ro that the command failed.

    Works from command line: magic:~$ id uid=8(mail) gid=8(mail) groups=8(mail) magic:~$ ~mailman/mail/mailman post testlist From: marc@merlins.org To: testlist@lists.merlins.org subject: test 7

    test

    But when sending through exim: >>>>>>>>>>>>>>>> Exim pid=1720374 (delivery-local) terminating with rc=0 >>>>>>>>>>>>>>>> mm21_transport transport returned FAIL for testlist@lists.merlins.org post-process testlist@lists.merlins.org (2) LOG: MAIN ** testlist@lists.merlins.org F=<root@merlins.org> R=mm21_main_director T=mm21_transport: Tainted arg 2 for mm21_transport transport command: 'testlist'

    I guess this said what was wrong, but it wasn't clear to me that tainted was an error and not a warning and that it caused the issue. Now this did become another rabbithole I need to solve with exim4 having made tainting a real pain to deal with, especially for the way I'm using exim4's local_part_data, that is still perfectly safe in my use case, but exim4 sadly decided that I cannot be trusted and is forcing an over strict and quite frankly very over bearing tainting system on me that is just breaking me without providing any easy opt out.
    I'm honestly not happy with exim4 on that one, especially the complete lack of useful errors in exim logs and poor documentation that gives easy and actionable steps to get out of this hole.

    So now, I'm many hours in trying to figure out how to fix exim4 and I'm really really not impressed at how they forced that overbearing tainting mechanism with very little info on how to easily fix things that it broke and that were working safely.

    So, exim4 took much longer to fix than it should have, here's a new page on it: Part #2 was unfortunately much more painful in an unnecessary way due to a poorly made forced API change in exim4

    π 2025-10-23 01:01 in Computers, Linux, Public
    After 25 years of running on donated hardware, magic.merlins.org aka marc.merlins.org aka ledtranceguy.org finally migrated to a server I built from scratch, for cheap, and was about 60 times more power efficient than the previous server (Dell Poweredge 2950). The Dell was almost 3 times slower since the hardware dated from 2006, and took more than 20 times more power (including the spinning rust drives).

    The more Raspberry Pi specific posts are here:

  • Using Raspberry Pi5 as a Server With Raid1, Btrfs, and Multiple NVME M2 or Sata Drives
  • Using a Raspberry Pi 5 (Rpi5) as a Server With Btrfs, Raid1, Serial Console and Dual NVME/SD Card Recovery Boot
  • Before you see the non professional looking mess of wires I built with 2 rPi5 and reclaimed/recycled drives (I only bought 2 new boot 2TB NVME for boot as I want those flash drives to work a long time), I considered another Dell server I had laying at home, not even sure where from or why. Looking it up, it was a Dell DSS1510 which seems to be a cheaper version of the R430. It's a very professional looking server with redundant power and all, and I did consider it, especially since Dell seems to use capacitors that don't just die years later and take the motherboard down with it.


    room for 8 2.5 Sata flash drives plugged into an unknow raid card
    room for 8 2.5 Sata flash drives plugged into an unknow raid card

    this shows the MB similar to R430 but with lots of stuff missing to save money
    this shows the MB similar to R430 but with lots of stuff missing to save money

    Research showed it was a system from 2016, an upgrade from my existing 2006 server :) but at the same time, do I really want to "upgrade" again to a server that is almost 10 years old? The colo I'm in (via.net, now nextlevel), nicely asked me if I could use less power for the monthly rate they are giving me, and this server can still peak at 200W. Even if it only takes a bit more than 100w, my double rPi5 solution takes less than 30W, probably between 10 and 20W when idle, and that's for 2 computers, giving better high availability and failover


    Good search said:

  • Single-Core Performance: The Raspberry Pi 5 and the Xeon E5-2620 v3 are remarkably close in single-core speed. The Pi 5's modern ARM architecture allows it to match the much older, higher-power Xeon core for single-threaded tasks. Both significantly outperform the ancient Xeon 5140 cores.
  • Multi-Core Performance: The Xeon E5-2620 v3 remains the leader due to its 12 threads. The Raspberry Pi 5 is second, still much faster than the dual Xeon 5140 setup.
  • Power Efficiency: The Raspberry Pi 5 maintains its huge advantage in efficiency, delivering similar single-core performance to the Xeon E5-2620 v3 while using vastly less power.
  • With 2 rPi5 I'm actually faster than the DSS 1510 for maybe 1/10th of the power, so not a bad deal :)

    So here is the end result I built:

  • 2 rPi5 with 32GB pro sdcard that will never be used except for recovery (I don't trust sdcards for long term use)
  • each system is setup to boot from 2TB NVME, top of the line Samsung 990 Pro. This is the one place where I spent money since drives are almost always the weak link long term
  • magic, server #1, has a leftover 2TB Sata M2 plugged via a USB3 adapter which gives very high performance, although it's really just a backup device I can failover and boot from if the NVME were to die (and I can do all this remotely)
  • moremagic, server #2 has 2 1TB Sata drives I had laying around plugged into an M2 Sata controller, allowing 6 drives total (middle of picture below)

  • The 2 things I had to engineer is using each server as a serial console server for the other one, as explained on my Using a Raspberry Pi 5 (Rpi5) as a Server With Btrfs, Raid1, Serial Console and Dual NVME/SD Card Recovery Boot blog.
    The next thing was how to get 5V power for those sata drives. My first solution was just to steal it from the GPIO port:


    But I found a dual sata power cable I had laying around and a 3 pin female plug with the right plastic bits to make it almost impossible to plug backwards (which would likely destroy the drives):

    this
    this

    to replace that
    to replace that


    The last relevant bit is to find those hard to find USB-C power supplies that give 5A on 5V (normally it's 3A max), although you could also get a real 5V power supply and feed the rPi through the GPIO pins, but that would bypass some protections. In the end, my very professional setup that did take many days to build and test, looked like this:


    oops, forgot to protect the back so it doesn't short when touching metal, duct tape to the rescue
    oops, forgot to protect the back so it doesn't short when touching metal, duct tape to the rescue

    the new setup on top fo the existing poweredge server running for a while as recovery/emergency
    the new setup on top fo the existing poweredge server running for a while as recovery/emergency

    And for shits and giggles, still found an original VA Linux server going strong, as a rack spacer :)



    Power Cycling

    Since the rPi5 sadly doesn't have full firmware support over serial (output only, no input to select the boot menu or do anything, really), expecting any kind of BMC functionality like power cycles is of course over optimistic. Due to this lack, I ended up adding a 3.3V controllable relay activatd power outlet that moremagic can toggle via GPIO (so basically moremagic can power cycle magic if it's truly hosed):


    Moremagic is back!

    I had magic and moremagic for many years (if you know the significance of those names, you are an ubergeek and you can Email me to brag, it's well deserved). Moremagic however died in Sept 2024, so I was running with no backup server for over a year, which was not good given that I'm not always home and could have suffered serious downtime if magic had died.

    Now I'm back with 2 servers, on the same network which is not ideal, but they are both redundant filesystem-wise and capable of taking over one another's duties if one were to die (likely the power supply I assume).

    Further reading

  • rescuing/rebuilding magic, and magic back online and live
  • Moremagic v1 died after 18 years of service
  • Magic v3 died, upgrade to V4, Dell Poweredge 2950 and 64bit linux!
  • Magic v5: From Dell Poweredge 2950 to Raspberry Pi 5 (skipping Dell DSS1510)
  • Finishing Upgrade of Year 2000 Linux System From i386 to amd64 to arm64 for Raspberry Pi5 with mailman 2.1.7 for Python 2 (the last 5% that took 70% of the time)
  • ]
  • Exim4 Mailman2 allow insecure tainted data local parts and local part data (what sadly made this migration a lot less fun around the end)

  • More pages: April 2026 November 2025 October 2025 July 2025 November 2024 September 2024 July 2024 June 2024 April 2024 December 2023 August 2021 May 2020 August 2019 March 2016 February 2016 July 2014 March 2014 December 2013 November 2013 January 2013 August 2011 July 2011 August 2010 June 2010 May 2010 March 2010 February 2010 December 2009 November 2009 March 2009 January 2008 December 2007 November 2007 July 2002 October 2001

    Contact Email