Monday, 10 January 2011

Server in the twilight zone

One of my customers has a HP Proliant ML150 G3 server running SBS 2003. I upgraded the memory and disk space in July in preparation for an upgrade to their main software application which required a bit more oomph. After the upgrade and sorting out a few little problems all was well. Come December the server would trip out in the night. Nothing in the system logs, it just reset itself, which indicated a hardware problem. The normal rule is: if something changes thats most likely causing the problem, so I checked the new hard drives which were on a RAID 1 (mirrored) and the volumes were not degraded. I swapped the memory back to the old memory and still the problem persisted, infact it was getting worse because it would trip out more and was liable to lockup during the day.

Winter was getting worse and the server room was getting really cold. No heating and there is an extractor fan to the outside which lets the cold in. The server has an operation temp range of 10 to 35 degrees centegrade. We had problems in the Summer and it tripped out due to overheating so now my thoughts were towards it being too cold. I used a program called Speedfan to monitor temperatures and certainly one of the processor cores was registering 7 degrees. Suspecting the cold or a faulty sensor I kept the server room warm and disabled thermal monitoring in the BIOS. This seemed to have an effect but did not cure it.

Now it gets really weird. Many years ago I created a simple backup protocol based on external disk drives connect by USB. I sucessfully used a program called Mirrorfolder to create a realtime backup to these external drives, such that at the end of the day a drive can be disconected, taken outside the building and they can have an absolute safe copy of the work for that day. The problem now was that the server would / would not recognise and attached drive. A restart would get it back. Next the keyboard would stop working: not on USB but the old small 6 pin DIN. I suspected the backup drive so copied data to and from it and it consistantly failed.

A solution was now at hand. Disconnect the backup and the server would not trip out or lockup. The problem here was that the customer would no longer have a backup which was not acceptable. A temporary solution is to have the backup on at night and take it off during the day. I have remote access and have to deal with the overnight restarts every morning to make sure they have a working system for the day. The proper solution is to get a new server, and that is being instigated.

So what has been going wrong? I have had machines with USB problems before. Usually a device is not recognised and I usually suspect power issues. The PSU cannot supply a stable current to the device. Replace the PSU or motherboard. With a server that can get expensive and their server was over three years old so its time for a new server.

No comments:

Post a Comment