Thursday, August 16, 2012

Solaris 10; SVM mirrors. Maintenence, last-erred


A server recently that was not being used is about to be put back into service. We wanted to check everything out on it and this was the status of its mirrors. I brought the box down to single to begin further investigations.

 # metastat -c  
 d106       m  22GB d16 d26 (resync-66%)  
   d16     s  22GB c1t0d0s6  
   d26     s  22GB c1t1d0s6 (resyncing)  
 d9        m  52GB d19 (maint) d29 (maint)  
   d19     s  52GB c1t2d0s6  
   d29     s  52GB c1t3d0s6  
 d8        m  16GB d18 (maint) d28 (maint)  
   d18     s  16GB c1t2d0s1  
   d28     s  16GB c1t3d0s1  
 d4        m  10GB d14 d24 (maint)  
   d14     s  10GB c1t0d0s4  
   d24     s  10GB c1t1d0s4 (maint)  
 d0        m  20GB d20 (maint) d10 (maint)  
   d20     s  20GB c1t1d0s0 (resyncing)  
   d10     s  20GB c1t0d0s0 (last-erred)  
 d1        m  16GB d11 d21  
   d11     s  16GB c1t0d0s1  
   d21     s  16GB c1t1d0s1  

Right before I captured this I had actually already performed the following in single: (which is why it shows syncing)

 metareplace -e d106 c1t1d0s6  

I also did the following commands (waiting for each one to finish syncing before the next):

 # metareplace -e d4 c1t1d0s4  
 d4: device c1t1d0s4 is enabled  
 # metasync d8  
 # metasync d9  

I wanted to get the other mirrors out of maintenance in-case I had to replace a disk. Now, all I was left with was Mirror d0. Sd1 is having soft and hard errors shown in iostat -E, which is d10.

 # iostat -E  
 sd1    Soft Errors: 6 Hard Errors: 1131 Transport Errors: 0  
 Vendor: FUJITSU Product: MAY2073RCSUN72G Revision: 0401 Serial No: 6643S13H5W  
 Size: 73.41GB <73407865856 bytes>  
 Media Error: 969 Device Not Ready: 0 No Device: 162 Recoverable: 6  
 Illegal Request: 1 Predictive Failure Analysis: 0  

Metastat of the mirror itself:

 # metastat d0  
 d0: Mirror  
   Submirror 0: d20  
    State: Needs maintenance  
   Submirror 1: d10  
    State: Needs maintenance  
   Pass: 1  
   Read option: roundrobin (default)  
   Write option: parallel (default)  
   Size: 41945472 blocks (20 GB)  
 d20: Submirror of d0  
   State: Needs maintenance  
   Invoke: metasync d0  
   Size: 41945472 blocks (20 GB)  
   Stripe 0:  
     Device   Start Block Dbase    State Reloc Hot Spare  
     c1t1d0s0     0   No    Resyncing  Yes  
 d10: Submirror of d0  
   State: Needs maintenance  
   Invoke: after replacing "Maintenance" components:  
         metareplace d0 c1t0d0s0 <new device>  
   Size: 41945472 blocks (20 GB)  
   Stripe 0:  
     Device   Start Block Dbase    State Reloc Hot Spare  
     c1t0d0s0     0   No   Last Erred  Yes  

Sub mirror d10 'Last Erred' because it was the last valid copy of the volume before needing maintenance. sub mirror d20 is stuck trying to re-sync, but it cannot completed.  My only choice was to attempt to fix the errors on the disk. I went through  /var/adm/messages and found errors from the disk c1t0d0, but only from the root slice, which is s0; the d0 mirror.

I opted to try and perform an analyze read on just that section of disk. First to get the sectors of the slice:

 # prtvtoc /dev/rdsk/c1t0d0s2  
 * /dev/rdsk/c1t0d0s2 partition map  
 *  
 * Dimensions:  
 *   512 bytes/sector  
 *   424 sectors/track  
 *   24 tracks/cylinder  
 *  10176 sectors/cylinder  
 *  14089 cylinders  
 *  14087 accessible cylinders  
 *  
 * Flags:  
 *  1: unmountable  
 * 10: read-only  
 *  
 * Unallocated space:  
 *    First   Sector  Last  
 *    Sector   Count  Sector  
 *      0 33560448 33560447  
 *  
 *                        First    Sector      Last  
 * Partition Tag Flags   Sector     Count    Sector Mount Directory  
     0        2   00   33560448  41945472  75505919  
     1        3   01          0  33560448  33560447  
     2        5   00          0 143349312 143349311  
     3        0   00   75505920     91584  75597503  
     4        7   00   75597504  20972736  96570239  
     6        0   00   96570240  46687488 143257727  
     7        0   00  143257728     91584 143349311  

Next, I went into format, selected the disk and chose analyze. Using the sector information above for slice 0, I entered the following:

 analyze> setup  
 Analyze entire disk[yes]? no  
 Enter starting block number[0, 0/0/0]: 33560448  
 Enter ending block number[143349311, 14086/23/423]: 75505919  
 Loop continuously[no]? no  
 Enter number of passes[2]:  
 Repair defective blocks[yes]?  
 Stop after first error[no]?  
 Use random bit patterns[no]?  
 Enter number of blocks per transfer[126]: 1  
 Verify media after formatting[yes]?  
 Enable extended messages[no]?  
 Restore defect list[yes]?  
 Restore disk label[yes]?  
 analyze> read  
 Ready to analyze (won't harm SunOS). This takes a long time,  
 but is interruptable with CTRL-C. Continue? y  

And here I sit...waiting for this analyze to continue.

1 comment:

  1. stumbled upon this post. I am too in the exact situation. Can you post results and what had you done please. This will help me immensely.

    ReplyDelete