Replacing A Failed Hard Drive In A Software RAID Array

Cara 1 :

Daftar Isi

1 Preliminary Note

In this example I have two hard drives, /dev/sda and /dev/sdb, with the partitions /dev/sda1 and /dev/sda2 as well as /dev/sdb1 and /dev/sdb2.

/dev/sda1 and /dev/sdb1 make up the RAID1 array /dev/md0.

/dev/sda2 and /dev/sdb2 make up the RAID1 array /dev/md1.

/dev/sda1 + /dev/sdb1 = /dev/md0

/dev/sda2 + /dev/sdb2 = /dev/md1

/dev/sdb has failed, and we want to replace it.

2 How Do I Tell If A Hard Disk Has Failed?

If a disk has failed, you will probably find a lot of error messages in the log files, e.g. /var/log/messages or /var/log/syslog.

You can also run

cat /proc/mdstat

and instead of the string [UU] you will see [U_] if you have a degraded RAID1 array.

3 Removing The Failed Disk

To remove /dev/sdb, we will mark /dev/sdb1 and /dev/sdb2 as failed and remove them from their respective RAID arrays (/dev/md0 and /dev/md1).

First we mark /dev/sdb1 as failed:

mdadm –manage /dev/md0 –fail /dev/sdb1

The output of

cat /proc/mdstat

should look like this:

server1:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10]
md0 : active raid1 sda1[0] sdb1[2](F)
24418688 blocks [2/1] [U_]

md1 : active raid1 sda2[0] sdb2[1]
24418688 blocks [2/2] [UU]

unused devices:

Then we remove /dev/sdb1 from /dev/md0:

mdadm –manage /dev/md0 –remove /dev/sdb1

The output should be like this:

server1:~# mdadm –manage /dev/md0 –remove /dev/sdb1
mdadm: hot removed /dev/sdb1

And

cat /proc/mdstat

should show this:

server1:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10]
md0 : active raid1 sda1[0]
24418688 blocks [2/1] [U_]

md1 : active raid1 sda2[0] sdb2[1]
24418688 blocks [2/2] [UU]

unused devices:

Now we do the same steps again for /dev/sdb2 (which is part of /dev/md1):

mdadm –manage /dev/md1 –fail /dev/sdb2

cat /proc/mdstat

server1:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10]
md0 : active raid1 sda1[0]
24418688 blocks [2/1] [U_]

md1 : active raid1 sda2[0] sdb2[2](F)
24418688 blocks [2/1] [U_]

unused devices:

mdadm –manage /dev/md1 –remove /dev/sdb2

server1:~# mdadm –manage /dev/md1 –remove /dev/sdb2
mdadm: hot removed /dev/sdb2

cat /proc/mdstat

server1:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10]
md0 : active raid1 sda1[0]
24418688 blocks [2/1] [U_]

md1 : active raid1 sda2[0]
24418688 blocks [2/1] [U_]

unused devices:

Then power down the system:

shutdown -h now

and replace the old /dev/sdb hard drive with a new one (it must have at least the same size as the old one – if it\’s only a few MB smaller than the old one then rebuilding the arrays will fail).

4 Adding The New Hard Disk

After you have changed the hard disk /dev/sdb, boot the system.

The first thing we must do now is to create the exact same partitioning as on /dev/sda. We can do this with one simple command:

sfdisk -d /dev/sda | sfdisk /dev/sdb

You can run

fdisk -l

to check if both hard drives have the same partitioning now.

Next we add /dev/sdb1 to /dev/md0 and /dev/sdb2 to /dev/md1:

mdadm –manage /dev/md0 –add /dev/sdb1

server1:~# mdadm –manage /dev/md0 –add /dev/sdb1
mdadm: re-added /dev/sdb1

mdadm –manage /dev/md1 –add /dev/sdb2

server1:~# mdadm –manage /dev/md1 –add /dev/sdb2
mdadm: re-added /dev/sdb2

Now both arays (/dev/md0 and /dev/md1) will be synchronized. Run

cat /proc/mdstat

to see when it\’s finished.

During the synchronization the output will look like this:

server1:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10]
md0 : active raid1 sda1[0] sdb1[1]
24418688 blocks [2/1] [U_]
[=>……………….] recovery = 9.9% (2423168/24418688) finish=2.8min speed=127535K/sec

md1 : active raid1 sda2[0] sdb2[1]
24418688 blocks [2/1] [U_]
[=>……………….] recovery = 6.4% (1572096/24418688) finish=1.9min speed=196512K/sec

unused devices:

When the synchronization is finished, the output will look like this:

server1:~# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid5] [raid4] [raid6] [raid10]
md0 : active raid1 sda1[0] sdb1[1]
24418688 blocks [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
24418688 blocks [2/2] [UU]

unused devices:

That\’s it, you have successfully replaced /dev/sdb!

Cara 2 :

Upon firing up the terminal app, I raised my privilege to super user using the “su” command. Once I have root level access, I ran the following command:

> cat /proc/mdstat

This gave me the following output:

Personalities : [raid10] [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4]
md1 : inactive sdh1[1](S) sdg1[0](S) sdi1[2](S)
 2930287296 blocks

md2 : active raid10 sdc1[2] sdd1[3] sda1[0] sdb1[1]
 1953524864 blocks 64K chunks 2 near-copies [4/4] [UUUU]

unused devices:

It shows that my RAID array “md1” has only 3 drives attached and it is inactive.

To be on the safe side, I stopped the RAID array:

> mdadm --manage --stop /dev/md1

Terminal output of the above command is:

mdadm: stopped /dev/md1

To start the array, I used this command:

> mdadm --assemble /dev/md1

Terminal output of the above command is:

mdadm: /dev/md1 has been started with 3 drives (out of 4).

Finally, I added the replacement drive back to the RAID array:

> mdadm /dev/md1 --manage --add /dev/sdj1

Terminal output of the above command is:

mdadm: added /dev/sdj1

By adding a drive back to the array, the array will automatically recover by replicating the data from the mirrored drive to the newly formatted drive. You can see the progress of the recovery by:

> watch -n 1 cat /proc/mdstat

The terminal window will refresh once every second to display the rebuilding process:

Every 1.0s: cat /proc/mdstat Tue Jun 5 23:29:27 2012
Personalities : [raid10] [linear] [multipath] [raid0]
[raid1] [raid6] [raid5] [raid4]
md1 : active raid10 sdj1[4] sdg1[0] sdi1[2] sdh1[1]
 1953524864 blocks 64K chunks 2 near-copies [4/3] [UUU_]
 [====>................] recovery = 21.5% (210768064/
976762432) finish=197
.2min speed=64726K/sec
md2 : active raid10 sdc1[2] sdd1[3] sda1[0] sdb1[1]
 1953524864 blocks 64K chunks 2 near-copies [4/4] [UUUU]
unused devices:

My RAID array took 4 hours to recover (replicate data to the new drive).