Odd graphics card issue - Solved but why?

HCHTech · Aug 20, 2018

We built two identical Autodesk workstations for an architect client of ours back in April. Both were delivered with fully patched Win10 Pro and latest BIOS & drivers.

Ryzen 7 1800X
32GB RAM
512GB Samsung 970 Pro NVMe drive
Antec full tower case
Antec 750W Gold PS
Nvidia Quadro K1200 graphics card
Dual 27" monitors

Client uses Autodesk Revit 2019, and the Adobe Cloud suite, plus the normal compliment of Office & various other common software.

About a month ago, one workstation starting crashing randomly. No BSOD or other evidence in the logs, just the maddening "the previous shutdown was unexpected."

This sounded thermal to me, so when I was onsite, I made sure all of the fans were working (2x case fans, cpu and graphics card fans, PS fan). I also changed the fan curve in the BIOS to run faster sooner based on heat. I also took a look for anything obvious that we might have botched with the assembly. Didn't find anything. I also loaded the latest Nvidia driver.

It seemed better for a week, but was still crashing occasionally. Same lack of evidence in the logs.

So...I was onsite this past weekend for server maintenance, and spent more time with the problem. To my dismay, I discovered that we had mounted the PS upside down. The case has an input with a filter on the bottom, and the PS fan side draws inwards, so it is clearly meant to use outside air to cool the PS. The identical workstation without the problem had the PS mounted correctly, btw.

I re-mounted the PS in the correct orientation, even though I didn't think it was related. By having the PS draw air from inside the case and exhaust it out the back, the incorrect mounting should have acted to draw MORE warm air from inside. The effect of the incorrect mounting would have been that the PS would run hotter, not the computer. Seems to me anyway. I booted up Revit & queued a render, which proceeded until 99% done, then the computer locked up. No errors in the logs.

I also ran furmark for a while, this shot the temperature of the graphics card pretty quickly to 85 degrees celsius, but it leveled out there. I was a little uncomfortable with it at this temperature, so I stopped the run after several minutes. The machine didn't lock up, though.

Next, I decided to swap the cards between the two identical workstations to see if the problem followed the card. To my surprise, it did not. Both workstations ran their normal software load without crashing for about an hour. Next, I booted furmark on both workstations and let it cook. Both cards shot up to 85/87 degrees, but leveled off there, and ran without complaint for about 30 minutes. Ok, so I booted up Revit on both workstations and queued the identical render (The files were copied first locally to the machine so it wouldn't be pulling them over the network.) Both computers ran for the next 20 minutes simultaneously running a full-screen furmark on one monitor and rendering a drawing on the 2nd monitor. Neither computer complained a bit and both finished the render successfully.

So....apparently swapping the cards fixed the problem, but I don't know how. They were clearly NOT mounted incorrectly initially. Both were secure in the same slot on both motherboards.

I swapped the cards back and took the problem workstation back to the shop where I could do full hardware diagnostics and try to break it over the weekend. It survived a 12-hour furmark run, 12 hour memory test, no errors at all in a full PC Doctor run, set to 50 iterations.

I took it back this morning and installed it, so time will tell, but this was an odd one, that's for sure. I wish I had dismounted and remounted the card in the same workstation, that would have put a finer point on the diagnostics had it fixed the problem, but I didn't.

ComputerRepairTech · Aug 20, 2018

HCHTech said:
About a month ago, one workstation starting crashing randomly. No BSOD or other evidence in the logs, just the maddening "the previous shutdown was unexpected."

Dislike the word crashing unless its pertaining to a bsod, do you mean it instantly rebooted, like you pretty much immediately saw the splash screen or video card text? did it power down and not come back up? did it turn off and then a few seconds later came back on?

What brand of Nvidia Quadro K1200?

HCHTech · Aug 20, 2018

My bad, I know better than that -> Crashing = hard freeze. Frozen image on screen, no mouse or keyboard control. Have to force power down by holding in the power button.

Cards are PNY.

GTP · Aug 21, 2018

So the problem is not "solved" but just hasn't manifest yet (or again)?
I had a similar situation with a gamer. 2 identical systems (one his one for GF).
GF's kept freezing up after about an hour's gaming.
Swapped PSU' for new ones, swapped ram between units, ran tests etc.
Didn't pull the cards till much later and on close inspection noticed that the one freezing up was not quite seated correctly.
When it was screwed down it pulled the far end of the card slightly out of the PCIE socket.
The case must have been slightly askew so I applied a good amount of force to the case and slightly bent it back.
Never had a problem again or since.

HCHTech · Sep 29, 2018

Well, This problem is back with a vengence. The machine is hard-freezing again so we've got it on the bench this weekend to figure it out once and for all.

Components:

Asus X370A Motherboard (latest bios)
2 16GB Sticks of Kingston DDR4-2400 RAM
Ryzen 7 1800X Processor
Antec Gold 750W power supply
Samsung 970 Pro 512GB NVMe SSD
PNY Quadro K1200 Video Card (latest drivers)
Antec P100 case - this is a full tower, there is a ton of empty space with good airflow - nothing is overheating
Win10 Pro 64 (1803 fully patched)

After backing up the data, we re-ran our hardware diagnostics, memory tests (only an hour or so, we're trying to fix this by Monday), hard disk test, video burnin, etc. Everything passes. We tried a fresh install of Windows, it freezes part way through. We've swapped out the video card, same answer. We moved the video card to one of the other available PCI slots, same answer. We moved the RAM to different slots, same answer. We removed the NVMe drive and put in a fresh SATA SSD, no difference - it froze part way through the Windows install. We measured the temperature of the CPU & northbridge chips, nothing exciting there. 35 degrees at the CPU and 50 degrees at the northbridge. The power measurements in the BIOS look spot on. The CPU temp in the bios matches our external measurement within a degree or two.

It also freezes when we boot to Linux, so it has got to be a hardware problem. We swapped out the motherboard for a new one (identical model), same answer. We removed the RAM and put in a single new stick, same answer. Since this is a 1st generation Ryzen, we can't try the integrated video.

Oh, and we looked VERY carefully at the graphics card and how it was seating - no case problems here, everything about the connection looks perfect.

Things we haven't done yet:
- Tried to assemble the system outside of the case - I don't think this is the problem, case shorts usually present as non-boot.
- Swapped out the power supply (we're in the middle of doing this now)
- Swapped out the CPU

I've got a Ryzen 7 2700 here in stock, but it's already "assigned" to a new workstation for this same client that they need for a new employee next Thursday, and we haven't put it together yet. I guess I'll try installing that if the new power supply doesn't help.

Just as an aside, we built two of these at the same time in April for this client, and the other one is working fine.

Edit: A new power supply (Antec Gold 650W) didn't change the problem in the end. We were able to install windows, but after several minutes of running, it froze up again. Maybe fresh eyes will be beneficial tomorrow, but it looks like I'm swapping the CPU next. For now, we're going to boot up PC Dr. and set it for 50 iterations of the CPU tests.

HCHTech · Oct 1, 2018

Well, it was the CPU. I swapped in the Ryzen 7 2700 and all is well. I put all of the original equipment back, reinstalled Windows, Office, network printers, re-joined the domain, restored the data and delivered it first thing this morning to the client. I couldn't make the CPU fail any of the tests I threw at it - the only symptom was a hard freeze, whether it was in Windows, Linux, or booted to the diagnostics suite (also linux). My diagnostic procedure was very inefficient for this one, unfortunately. I narrowed it down to the CPU by basically replacing every other part. I think I need to clean my crystal ball, not very proud of this one.

NJW · Oct 2, 2018

HCHTech said:
I think I need to clean my crystal ball, not very proud of this one.

Don't beat yourself up over this. As @sapphirescales mentions in another thread, CPUs are so reliable these days that they're always the 'last resort' suspect. You may never see another CPU failure and I don't think we need to start swapping CPUs as a primary troubleshooting test.

ComputerRepairTech · Oct 3, 2018

HCHTech said:
I narrowed it down to the CPU by basically replacing every other part. I think I need to clean my crystal ball, not very proud of this one.

I wonder how many of us looked at the processor and thought to themselves...maybe..it is amd after all....nah its almost never the cpu.

GTP · Oct 4, 2018

In almost 20 years I've only had one processor fail. And I still think that was my fault! (unconfirmed).
It was a 2.6GHZ P4. I think I screwed up with the heatsink but not sure.
Intel replaced it no questions asked, so if it had been "cooked" they would have rejected it.

CPU failure is so rare that you never suspect it.

DRPCNZ · Oct 4, 2018

NJW said:
Don't beat yourself up over this. As @sapphirescales mentions in another thread, CPUs are so reliable these days that they're always the 'last resort' suspect. You may never see another CPU failure and I don't think we need to start swapping CPUs as a primary troubleshooting test.

I concur having to swap out CPU's to test has long gone. One of my more favorites sayings is "Computers have no higher purpose than making Users & Techs feel STUPID"

Odd graphics card issue - Solved but why?

HCHTech

Well-Known Member

ComputerRepairTech

Well-Known Member

HCHTech

Well-Known Member

GTP

Well-Known Member

HCHTech

Well-Known Member

HCHTech

Well-Known Member

NJW

Well-Known Member

ComputerRepairTech

Well-Known Member

GTP

Well-Known Member

DRPCNZ

Well-Known Member

Similar threads