PDA

View Full Version : Hardware Test running on Hummer



todd
05-19-2009, 02:09 PM
Greetings!

I wanted to touch base with you all to let you know that we are currently running offline test on Hummer Harddisk -- we have taken hummer offline. The test should take 1 hour atleast.

Why is this test?

From yesterday we are facing strange issue with hummer -- that is /home drive goes auto in Read mode. (that disables users to update / change any file)

If we reboot server the error goes away temporary and its back again in short time.

We run online hard disk checks which showed everything is normal -- here are test results

**************
RESULT SUMMARY
**************
Test Start time: Tue May 19 07:24:53 2009
Test Stop time: Tue May 19 07:29:54 2009
Test Duration: 000h 05m 01s

Test Name Cycles Operations Result Errors Last Error
CPU - Maths 540 328 Billion PASS 0 No errors
Memory (RAM) 1 378 Million PASS 0 No errors
Disk: Startup Disk [/dev 11 2.588 Billion PASS 0 No errors
Disk: Hard Disk (/tmp) [ 123 2.736 Billion PASS 0 No errors
Disk: Hard Disk (/home) 0 2.919 Billion PASS 0 No errors
Disk: Hard Disk (/backup 1 8.696 Billion PASS 0 No errors
Disk: Hard Disk (/usr) [ 17 3.912 Billion PASS 0 No errors
Disk: Hard Disk (/var) [ 17 3.746 Billion PASS 0 No errors
Disk: Hard Disk (/boot) 562 1.235 Billion PASS 0 No errors
Network: 127.0.0.1 91743 770 Million PASS 0 No errors
TEST RUN PASSED

*********************
SERIOUS ERROR SUMMARY
*********************

=============

Now DC has suggested to run offline harddisk check -- which they have already started and we are waiting for an update from them.

Hopefully this gets sorted as soon as possible.

I would like to apologize to all clients on hummer for this sudden downtime required for hardware tests -- please be assured we are working on this server and we will get it fixed as soon as possible.

Again thank you for your patience..

todd
05-20-2009, 06:23 AM
Though the server is up and working (but still the read only file system error is there)

The previous offline test results were normal --

root@hummer [/home]# smartctl -l selftest /dev/sda
smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 8649 -
++++

========

Data center wants to perform a forced filesystem check on /home and remove and recreate the journal to see if this solves the issue.

This will have again 2 hrs down time. We will have to ask them to get this done to get the issue resolved completely..

I will keep you all posted with results..

todd
05-20-2009, 08:23 AM
The server is now booted into the Rescue Layer, and in the process of running a fsck on /home. Afterward we will re-build the journal, which should take care of the issue with the file system on this server randomly going read-only.

todd
05-20-2009, 09:29 AM
FSCK is done and server is online now -- we are closely monitoring this server to see if /home again goes to read only mode.

We will keep updating this thread further with more details.