Log in

RAID-5 misc - brad's life [entries|archive|friends|userinfo]
Brad Fitzpatrick

[ website | bradfitz.com ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

RAID-5 misc [Jan. 27th, 2007|03:11 pm]
Brad Fitzpatrick
[Tags|, ]

I never use RAID-5, so I'd never noticed this before:
   -f, --force
        Insist  that  mdadm  accept  the  geometry and layout
        specified without question.  Normally mdadm will  not
        allow  creation of an array with only one device, and
        will try to create a raid5  array  with  one  missing
        drive (as this makes the initial resync work faster).
        With --force, mdadm will not try to be so clever.

And indeed, when I created the array with 5 disks, it marked one as a spare:
# mdadm --detail /dev/md1
        Version : 00.90.03
  Creation Time : Sat Jan 27 13:30:36 2007
     Raid Level : raid5
     Array Size : 1953545984 (1863.05 GiB 2000.43 GB)
    Device Size : 488386496 (465.76 GiB 500.11 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Sat Jan 27 13:52:08 2007
          State : clean, degraded, recovering
 Active Devices : 4
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

 Rebuild Status : 27% complete

           UUID : 5ad3ba82:30b256f3:c70f55c8:1f40abbd
         Events : 0.194

    Number   Major   Minor   RaidDevice State
       0       8       48        0      active sync   /dev/sdd
       1       8       64        1      active sync   /dev/sde
       2       8       80        2      active sync   /dev/sdf
       3       8       96        3      active sync   /dev/sdg
       5       8      112        4      spare rebuilding   /dev/sdh

And you can see that 4 disks are reading, and 1 is writing:
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdd             198.00     45824.00         0.00      45824          0
sde             198.00     45824.00         0.00      45824          0
sdf             200.00     45824.00         0.00      45824          0
sdg             203.00     46080.00         0.00      46080          0
sdh             203.00         0.00     46080.00          0      46080


It makes sense why it's done this way: 4 disks doing sequential reads and 1 doing sequential writes is faster than 5 disks doing mixed reads and writes.

But really? 6 hours?

I'd prefer an option where all disks are zeroed, and then initial resync is skipped. Yes, array wouldn't be immediately usable like it is with the kernel doing the background sync for me, but I think I could zero a 460 GB disk quicker than 6 hours... based on 100 MB/s filesystem writes I saw, assuming I can do even more to a raw block device, should be about an hour) But I can't see how... --assume-clean may be what I'm looking for? Do I just zero all the devices myself first, then re-create the array?

I wouldn't normally mind, but I want to performance-test several configurations and 6 hour waits seriously kills my flow. :)

[User Picture]From: edm
2007-01-28 12:10 am (UTC)


If you want to performance test configurations, why not define, eg, a 2GB or 10GB partition on the start of each disk, then build those into a RAID set? The rebuild will be much faster, and you'll get to do your RAID-set layout tests much quicker. (Of course if you want to fill the RAID set with 400GB of data this won't help -- but filling a disk with 400GB of data takes Some Time (tm) too. And of course the start of the disk is faster access than the outer disk -- but if you care about this, try with partitions defined at different points on the disk.)

For RAID-5 I'm not sure that you can just zero the disks and do --assume-clean, because I'm not certain that the parity disk ends up with all-zeros on its blocks (I don't remember the parity algorithm used off the top of my head). Doing --assume-clean and then zeroing the whole RAID set should, in theory, work, but I can't see it being a whole lot faster than letting MD resync, as you're still doing RAID-5 parity calculations and writing to 6 disks at 45MB/s. The limiting factor here is the 45MB/s to each individual disk platter.


PS: I normally make my software RAID sets on, eg, 32GB or 64GB partitions, and then use something like LVM to join them together again. I do this precisely to keep the resync time for any given RAID set down to, eg, 1 hour.
(Reply) (Thread)
[User Picture]From: edm
2007-01-28 12:20 am (UTC)


Incidentally ((460GB * 1024)/45 MBps)/3600 = just under 3 hours. So absolutely best case to write out over an entire disk is 3 hours. Thus 6 hours to write it all seems a bit long, but not unbelievable. As I said, there's a reason that I do my RAID sets in smaller chunks than "whole disk".

(Reply) (Parent) (Thread)
[User Picture]From: brad
2007-01-28 01:40 am (UTC)


Yeah, I'd done my math wrong. My original 100 MB/s number was from a 4-stripe LV. So I probably don't get anywhere near 45 MB/s and probably more like 25 MB/s instead.

6 hours starts to make sense. :)
(Reply) (Parent) (Thread)
[User Picture]From: edm
2007-01-28 02:08 am (UTC)


I was basing the 45MB/s figure on the Blk_read and Blk_write figures in your output (about 45,000 per second per disk). And 45MB/s is definitely the right order of magnitude for a modern disk platter, which was part of why I took that figure without much extra consideration.

However the iostat man page sugggests that the blocks reported are actually sectors in Linux 2.4 kernels and later, and thus is 512 Bytes. Given that figure -- which calculates out to 22.5MB/s -- that translates pretty directly to 6 hours to resync with the same calculation as I used previously.

Although I'd be wondering why you're getting only 22.5MB/s off your disks; it seems a bit low for a modern disk that is SCSI connected or even SATA connected.

(Reply) (Parent) (Thread)
[User Picture]From: brad
2007-01-28 02:20 am (UTC)


They're SATA with NCQ enabled. But the Linux rebuild code uses only "idle disk bandwidth", so I wonder if it's only submitting one I/O at a time to be nice. And small I/Os at that: 64kB stripe size. So lots of little 64 kB I/Os one at a time not as good as submitting either huge ones, or lots of small ones?


I'll do tests later on the raw devices.
(Reply) (Parent) (Thread)
[User Picture]From: edm
2007-01-28 02:40 am (UTC)


The "idle" bandwidth code will use up to all the disk bandwidth -- or the bandwidth specified in the dev.raid.speed_limit_max sysctl if it's lower (default maximum seems to be 200MB/s) -- assuming there is no other I/O going on. But any other I/O will take precedence. OTOH, as you say 64kB is somewhat suboptimal in terms of a read/write chunk on a modern disk, and the resync code may be only reading one stripe in at a time and writing it out limiting the throughput (and, eg, not getting any advantage of NCQ).

These days it's probably worth specifying a larger chunk size than the default 64kB; I suspect 128kB or 256kB probably matches modern disks ability to stream into their on-disk cache a bit better. (Alas this can only be specified when you create the array.)

(Reply) (Parent) (Thread)
From: jeffr_tech
2007-01-29 07:11 am (UTC)


Recent SATA drives have 8 or 16mb caches. If the disks are otherwise idle, 64k transactions are more than enough to eat up all of the drive's bandwidth as long as the latency between them is low and the write cache is enabled. If you have a battery backup or don't care about your data integrity you can force the write cache to be enabled.

I just bought a sata raid card for my box. It must store in it's configuration which blocks contain valid data because it takes no time to initialize the array. For an extra $80 I bought a battery backup for the raid card and it has 128mb of write cache. For many datasets you're limited by PCI-X bandwidth and not disk throughput. :)
(Reply) (Parent) (Thread)
From: jeffr_tech
2007-01-29 07:27 am (UTC)


read 64m
10# dd if=/dev/ad0s1b of=/dev/null bs=64k count=1024
67108864 bytes transferred in 1.688534 secs (39743863 bytes/sec)
write 32m
10# dd if=/dev/zero of=/dev/ad0s1b bs=64k count=512
33554432 bytes transferred in 0.662472 secs (50650339 bytes/sec)
write 64m
10# dd if=/dev/zero of=/dev/ad0s1b bs=64k count=1024
67108864 bytes transferred in 1.498063 secs (44797095 bytes/sec)
write 128m
10# dd if=/dev/zero of=/dev/ad0s1b bs=64k count=2048
134217728 bytes transferred in 3.607651 secs (37203634 bytes/sec)

These are the stats for the pata drive in my laptop. It's 7200 rpm and has write cache enabled. This is doing real 64k transactions to the drive. I was writing to an offset at about 512mb from the start of the disk on a mostly idle system.

Notice that as the size increases on writes the throughput decreases. This is because the writes are hitting the drive cache and eventually we get to the real write speed of the drive, which is apparently about 37 mb/s. Not bad for a laptop.

The fastest rate I was able to get was about 66mb/s, which is about 50% max theoretical bandwidth for a 32bit 33mhz pci bus, which probably really maxes out at about 70% of 32bitx33mhz. That's really not too bad although I'd be curious to know why we didn't get that last 20%.

I'll have to check out my raid array when I get home, although the performance from dd will be misleading as it sends a single io at a time to the device. For reads you'll see one drive's performance but for writes we might be able to measure the bus speed.
(Reply) (Parent) (Thread)
[User Picture]From: ydna
2007-01-28 02:04 am (UTC)


I like your idea for keeping 32GB or 64GB partitions for the sake of faster resync time. But if you blow a 400+GB disk with six or seven 64GB paritions, don't you have to rebuild all the RAID sets the intersect with that drive anyway?
(Reply) (Parent) (Thread)
[User Picture]From: edm
2007-01-28 02:16 am (UTC)


Sure, you need to rebuild everything that touches the disk. But you can do it in sections (eg if you shutdown/reboot the box in the middle of the whole disk rebuild you don't have to start from scratch), and if you have a failure on a group of sectors there's a good chance that they're issolated in one of the RAID set chunks so the rest still run at full performance not degraded performance.

I was also encouraged in doing this by one client system with a partly supported Promise SATA controller which would crash from time to time, particularly under the load of resyncing. With this strategy, plus limiting the bandwidth used to resync, it usually survives a resync -- and where it doesn't, it can usually be done in stages.

(Reply) (Parent) (Thread)
[User Picture]From: ydna
2007-01-28 02:21 am (UTC)


Ah, okay. I like the idea even more. Thanks, Ewen.

Yeah, Promise cards... the only promise I got from them was to corrupt my arrays. Heh.
(Reply) (Parent) (Thread)
[User Picture]From: loganb
2007-01-28 04:56 am (UTC)


When you say "You can do it in sections..." do you mean "You, the operator" or "You, the intelligent kernel?" By default will it try to rebuild all the raids at once or will it reconize that multiple ones intersect the same physical drive and then only rebuild one at a time?
(Reply) (Parent) (Thread)
[User Picture]From: edm
2007-01-28 05:16 am (UTC)


It's automatic.

The kernel recognises which RAID sets use overlapping resources (eg, drives), and avoids rebuilding ones which require the same resource at the same time. However it'll rebuild RAID sets using non-overlapping sets of resources in parallel. So, eg, if you have RAID-1 sets on sda1 and sdb1, and another on sdc1 and sdd1 then they'll both be rebuilt in parallel. But if you have RAID-5 sets on sda5, sdb5, sdc5, sdd5 and another one on sda6, sdb6, sdc6, and sdd6, then only one of them will be rebuilt at a time (and neither will be rebuilt while the earlier ones are being rebuilt).

The particularly useful thing from the point of view of rebooting is that if you notice that it's finished two of the RAID sets, and you need to reboot again (eg, another power outage needed, or hardware swap or something) then you can do so, and when it comes up it'll start on the RAID sets which remain rather than doing the first two again. If it's all in one gigantic RAID set, it starts again from the beginning when you reboot, which is rather painful if it takes, eg, 6 hours to resync it all with no other I/O on the system...

(Reply) (Parent) (Thread)
[User Picture]From: askbjoernhansen
2007-01-30 05:12 pm (UTC)

RAID 5 ?!

Don't use RAID 5. A disk will fail and while it rebuilds a second disk will fail, too.

If you have more than 4-5 disks, use RAID6. If you have just 4 disks, stick with RAID10.

I also second the suggestion of splitting up the disks in smaller chunks. Beware though that the Linux SATA (or SCSI, I forget) layer only likes ~15-16 partitions per disk. I started making each of my raid chunks 80-100GB. That way when disks are +1TB, I don't have to merge the old partitions.

- ask
(Reply) (Thread)