Nailing down the cause of a hard freeze

HCHTech

Well-Known Member
Reaction score
4,178
Location
Pittsburgh, PA - USA
I have an SBS2011 server, new in 2012, so not slated for replacement quite yet. It's a small domain, I think 8 users.

Anyway, every couple of weeks (but NOT on a predictable schedule) the thing locks up hard in the wee hours of the morning. always between 2:00 and 3:30 in the morning. It's on my monitoring system so I know the instant it stops communicating. Besides being in this time range, it's not predictable. When I get to the server the next morning, there is a black screen and it doesn't respond to mouse clicks or keyboard input. Nothing to do but a hard reset, which sets things right again.

There is nothing in the logs at all that give a clue to the cause of the freeze. It is sitting on top of a desk in an empty office and well ventilated. It's on a good battery backup and there are no entries in the UPS log about voltage drops or spikes. I have searched and cannot find any scheduled tasks that might be running at that time. There is a daily onsite backup that starts at 10pm and finishes by 11 or so. There is a daily cloud backup that starts at 11:30 and finishes by 1. Nothing in either log gives any hint of problems.

I thought it might be updates, so shut off Windows updates altogether as a test last month - it still froze again last night.

I'm at a loss on how to get this thing to give me some clues to follow. Little help?
 
Agreed on the hardware, sorry about that - It's an AMD opteron with 16GB of RAM, and 3 SAS seagate 10K's in a RAID5 on an adaptec card. They are using exchange, but that's about the hardest work the server does. This customer was new this past year for me, so I don't have much history on the machine. No troubles until a couple of months ago, though. No new software installed, and fully patched until I shut down the updates. The RAID utility reports the disks are healthy and the logs don't point to anything along those lines.

In fact, it has run just fine when everyone is working and using it the hardest. It freezes at a time when I would expect it is mostly idle.

The logs are similar, but I don't think identical before the freezes. I can paste some samples if that would help, but the entries are mostly all benign. No critical errors preceeding the freeze, just the expected "The last shutdown was unexpected" when it starts back up again.
 
It is running Exchange. What is happening to the mail during the "down" time? Are you 100% certain that it is locked up? Network shares drop and so on?
 
Power Supply is also where I would look next. CAPS and so on. An AMD Opteron seems a poor choice for an email server. **** like this is why you move them to O365.
 
Sounds like a cloner...those are hard servers to troubleshoot.
NIC drivers up to date?
Run the SBS Best Practice Analyzer on it?

I know the W3 services can run away, combined with Exchange services, on some clients older (typically heavily used..I know this one isn't) servers I've had to schedule an IISRESET followed by bouncing all Exchange services. The server would slowly fall asleep ..first sign of things going was the remote portal would not allow users to browse the company share through the web based file explorer.
 
Power Supply is also where I would look next. CAPS and so on. An AMD Opteron seems a poor choice for an email server. **** like this is why you move them to O365.

Thanks - I think I'll get them to let me have it for a weekend. I was hoping to get something to point to a culprit because of the unusual timing, but you don't always get what you want, haha.

w/r/t the choice of chip, I didn't choose it - but have already discussed O365 for when we do replace the server next fall (if it makes it that far!). They are a prime candidate, that's for sure.
 
Next fall...heh....that's a lot of finger crossing that a server will boot up after another 20 or so "hard/rude reboots". Probably not battery backed cache on that controller like HP or Dell RAID controllers...those give you more confidence in no corrupted when rude reboots happen. (when I say rude reboot...I mean hard reboots, pulling the power cord, holding in the power button, etc).

You have small business server that runs a loooooot of stuff, AD, tons of SQL instances, Exchange,
Might consider moving in new server hardware sooner than the migration...to a P2V...run SBS as a guest on that new server, and then do the migration next fall to Essentials and O365.
 
You will need to have broader look at the logs given the nature of the failure. An OS crash takes more than a few bad bits, etc these days. But the timing thing is a big indicator to me. What about other machines on the network? Are they left running overnight? Any problems? Another thing to look at. Are there any maintenance, etc activities scheduled in that time frame? I'd also swap out the UPS with a known good one, preferably one you have been using for a while.

Given this is a gremlin, a problem you can't put your hands on, it's going to be tough going. If that RAID card does not have a battery, as pointed out, I'd be pushing for a change. Is the backup bullet proof? Meaning have you tested it recently?
 
Just for kicks, I spent a little time googling 'Opteron vs. Xeon". I found this quote for "Advantages of Xeon processors"

-------
- Nobody ever got in trouble for buying Intel. Buying from the 800 lb gorilla of the market allows the IT peon to shift the blame from themselves if anything goes wrong. Buying from somebody other than the 800 lb gorilla in the field invites personal criticism and blame if anything goes wrong even if saves the CFO a bundle.
-------

I do think in general, there is some unwarranted bias against AMD in the server market. It's easy to look down our noses at home-built servers like this, but hear me out. The previous owner of the company i acquired built these regularly for small businesses like the one this thread is about. I understand all of the arguments against built-servers like this one, but it's not hard to see the financial reason they make sense for the small end of the market. Sure, you give up support (or more accurately, you take on the liability for all support), but with small businesses, you ARE the support anyway. If something goes south, YOU are the one they will call in any event.

When I bought this company at the end of 2015, I took on about 25 of these types of servers of various age (with some trepidation, I might add). I have a spare server sitting on the shelf in case I need an emergency replacement for one of them, and it has sat there all year, unneeded. In fact, this particular client is probably the only one that has had a problem that made me consider dusting off that spare. I probably shouldn't say that out loud, I'll need two of them next week - haha.

I won't go so far as to say I've had more problems with the more-accepted Dell or HP servers of the bigger businesses I handle, though. They may have had more problems, but that is more due to their usage. More users, more data, fancier softwares, more risk of problems...regardless of hardware.

I don't know why I'm rambling on about this - I guess it's just that my experience with small servers like this over the last year has surprised me - it's been pretty positive, all in all. In the low end of the market like this, you can put one of these in for significantly less than a Dell or HP solution. It kinda makes sense, and I understand why folks would do it.
 
It's not so much against AMD themselves....CPU versus CPU. And, in my middle years of building custom gaming machines, I churned out a LOT of AMD based gaming rigs based on the XP Palomino CPUs like the XP1200 -XP1400 days (when Intels first generation Pentium 4's sucked)...so I'm not biased against AMD processors. There was finally a good motherboard chipset I liked, nForce, so I was loving those days.

It's a couple of factors.

*Everything else on those systems is typically cheaper. Let's look at those $399 turd laptops from chain stores. Total disposable painfully slow crap The cheapest of wireless NICs, the ultra slow low rpm short life span hard drives, cheaper graphics, cheaper motherboard. Not even worth 1/2 hour time trying to troubleshoot.

*Home built systems are often just a grab of "motherboard of the month club" parts. Without rigorous, thorough testing of compatibility between parts. It's thrown together by someone who "thinks" they should work well together. Hardly any cloner builders thoroughly test parts working together over long periods of time. I'll give a quick example. Years ago the company I worked for back then was trying to get this big law firm as a client. They had a home built server (and workstations) from this large local IT company that built their own. They had a server that would hard lock once in a while, similar to yours. We had planned a server upgrade towards end of the year, this client ran off of Microsoft Back Office Server (basically the NT 4 Server suite that evolved into Small Business Server 2000). I spent a long time troubleshooting that, no luck. We bumped up the SBS2000 install...and shortly after that I finally found the issue of their NT 4 server lockups, I stumbled across probably one of the 10 other people in the world that had the same hardware combination on some forum, and there was issue between the RAID controller and particular motherboard used in this server (I think it was Soyo or maybe SIS).

And BTW, my above statement can happen with Intel systems to. So it's not AMD that I'm against, it's home grown servers when used in a business. And I've built quite a few servers...and I mean real servers..not glorified desktops. But I never used them for a client, only for myself, our office, and back when I was big into online gaming and I built/supported/ran some popular public gaming servers which I co-located at an ISPs data center.

Support from the clients perspective is another thing. You put together some server, install it at a client, you get hit by a bus, 95% of the other techs around don't want to support your Frankenstein of a server. How's that fair to the client? Part of this profession is not only doing what's good for you and your business, but factoring it what is best for your client also. Getting too niche can really screw the client if you're gone, they're stuck with something nobody else wants to touch. You're also putting more pressure on yourself, while you're still alive, client calls, issue with server, you and only you are going there to fix it, burden is on you. With a tier-1 server, heck I can have a long distance client, with basic troubleshooting or just gathering info remotely via iLo or iDrac...I spent a few minutes on chat support or on phone support, and a tech is dispatched onsite by Dell or HP. They do all the work and travel.

It's not that home built computers "can't" run well, it's that...most of the time, they're put together my an amateur (or insert the phrase "pizza tech") that doesn't really know what they're doing. I have over 20 years supporting SMBs where I've seen a tremendous amount of servers in my career to base my NSHO on. A good portion of the first half of my computer career was working for 2x locations of a large franchise called Computer Land....and they had a HUGE amount of business clients, so I saw "a lot".
 
Well I don't have server experience but if I got numerous examples of it happening and always between 2am and 3:30am and it was a normal windows machine I suppose I would first check if drives are set to go to sleep on inactivity. I may also connect remotely and record the remote session and have process monitor open to see whats running before the freeze if anything.
 
Thanks for all of the suggestions - and I totally get the argument YeOlde stated - and I agree given his suppositions. In my particular case, the previous owner of the company I acquired was had a 30-year business supporting only commercial clients. He was definitely old school and a little stubborn. He would use the standard vendor solutions for his "larger" clients, but for the small ones like this he liked to build his own. Almost every one has the very same motherboard, same brand of memory, same brand of power supply, etc. I don't have an inventory of the particular processors, but all of them are AMDs. I have no trouble at all imagining that he found parts that would work well together and he could support (it was just him as the only tech and an administrative person). I wouldn't ever assign him the label pizza tech. I'm sure he felt the cost/benefit ratio was just so much better and these little clients would almost certainly agree.

On the other end of the spectrum, one of the larger clients is a 35-employee, 2-location chiropractor. Vendor solution all the way. 2 Big Dell Poweredge towers each with dual hot-swap power supplies (he wasn't into virtualization), and a smaller server running Appassure backup syncing to an identical unit in the owners home. Vendor support agreements with same day SLA - the works.

He used Sonicwalls exclusively, for example. Over time, he knew the firmwares inside and out, how to make them do anything but paint your house, I believe. He recently talked me through a sticky port-translation setup while he was on a boat fishing in Florida - haha.

Anyway, enough about that - I just wanted to underline that building the little servers was a conscious decision by someone who understood the risks and benefits. His clients all loved him, so I'm not about to argue the point. And, as stated before, I can definitely understand. This practice has also given me a challenge. When it comes time to replace these boxes (I have 5 or 6 that are due this year at some point) I will have to decide whether to stay this course or go the more acceptable (and costly) route. It's going to be a fun year.
 
I may also connect remotely and record the remote session and have process monitor open to see whats running before the freeze

Ohh, I like this idea - I was just reading up on how I might get a performance monitor log over time.

I've arranged some downtime this weekend - I'll run some hardware diagnostics and plan on replacing the PS as nline suggested, and maybe swap out the UPS, I do have a spare unit here for emergencies. Luckily for me, this is a pretty sleepy little office - almost no one works weekends. :-)
 
Back
Top