How much free space do you leave on your RAID arrays?

HCHTech

Well-Known Member
Reaction score
4,156
Location
Pittsburgh, PA - USA
I ran into a problem of my own making yesterday, the App server VM on a customer's Hyper-V host crashed (state="pause-critical") because the disk ran out of space. It was a 24TB array, with 2 vhdx's, a 20TB and a 1 TB. Something happened with the backup (I suspect) and a checkpoint was created, but didn't merge back into the vhdx when the backup finished. As a result, the differencing file grew in size until the entire array ran out of space. This all happened in the space of 24 hours, and by the time my disk space warning (set at 500GB) kicked off, the machine was already down.

Luckily, deleting the checkpoint allowed a successful merge, but now I'm stuck with the job of shrinking the size of that fixed vhdx. They are only using 11TB of the 20TB size, so I don't think the world will stop if I shrink it to 16 or 18TB. Also, I'm going to revisit the other customers where I'm maintaining servers and checking to see what I did there.

I probably should have used dynamic vhdx's, or I should have made the fixed-size smaller to leave more space for this kind of thing to happen without such an immediate consequence. This brings up the overall configuration question, though: What is best-practice for leaving free-space on an array when doing the initial configuration? What are you doing in practice?
 
Hypervisor level backups do snapshotting on the host's storage.

What does this mean? It means you leave 50% of the hypervisor's space free OR risk all of the above. Just like you need 2 to 2.5x the backed up storage on your backup appliances that can spin up a VM of whatever they backed up.

Your only other choice is to use VM level backups, because that moves the snapshotting load INTO the VM.

Also, never, ever... and I mean NEVER use thick provisioning. Do you want to wait for a 20TB image to restore after you delete the old one to get that VM online?

Or do you want enough space to have two copies of that VM on the same host to work through an issue? It's all the same problem.
 
Also, never, ever... and I mean NEVER use thick provisioning. Do you want to wait for a 20TB image to restore after you delete the old one to get that VM online?

Well, I can tell you two things - Firstly: no one taught me this, I just read the books (for example 'Mastering Windows Server 2016 Hyper-V" and asked occasional questions here. This means I get to run into both the upsides and downsides of my various decisions on my own in practice.

Secondly: When I made the decision to use fixed-size vhdx files vs. dynamic, I read the pros and cons of each and made my decision. Part of this was keeping the OS vhdx separate from the data vhdx. On this server, for example, the 180GB OS vhdx is on a completely separate array than the 20TB data vhdx. This (in my opinion at the time and arguably still) minimized the potential downside of restoring an entire VM if it went down. One of the "pros" of using fixed size is avoiding the issue of reclaiming space used by deleted files, for example. Another is avoiding the processing resources taken by dynamically resizing the vhdx. These may be overridden by the "cons" of using fixed-sized vhdx as you have noted, but in my opinion at the time, I didn't think so and didn't go that way.

This server has run rock-solid for 2 years now, so I think I just have to live with this as a potential issue and focus on getting notified when it is in play, as well as increasing the free-space that I can by manually shrinking that fixed-sized vhdx by 4TB or so.

Also, I have always made sure that the OS array has enough space for 2 copies of any machines on it, for exactly the reasons you noted. I just didn't carry over that thought to the data arrays, mostly due to cost (more drives + more drive bays needed, etc. maybe even a bigger/additional RAID card - all the dominos that fall with greater needed capacity).
 
I like to spec a new HV host at >2 to 2.5x total sum of the guests.

But I do backups via Datto within the guests, never HV backups, so I rarely deal with the checkpoint file bloat (except back in the early days of 2008 non r2 server that had the known bug).

Easy enough to expand a HV host if you use good hardware with more drive capacity...can slam in more drives down the road if needed.

Thick vs thin/dynamic, depends on the job of the guest. Some I use dynamic, some I still prefer thick. Easy enough to resize if needed.
 
Dynamically allocated disks should be used for all workloads EXCEPT heavy ones. And by that I mean an actually busy SQL server. Most of us simply don't live in the IOPs land that needs thick provisioning.

But everyone that works with hypervisors learns this lesson the hardway at some point... Everyone I know has a server just like yours in their past, it's just a thing...

All you need to do to really adjust for it is make sure you're using an agent on the GUEST to backup that monster VM instead of the host. That way any snapshotting is done inside the disk, instead of outside.
 
I'd say rough rule of thumb I start getting worried if free space on the host drops much below 20%. Depends on a lot of factors though as different workloads produce different results. A bunch of webservers are going to have tiny checkpoints compared to SQL, Exchange, File servers etc.

As for your case you can convert fixed disk to dynamic but it's not going to be straightforward.

First hurdle is you are going to need at least 31TB of space to achieve it (20TB fixed + 11TB Dynamic). The conversion process is basically copy fixed > dynamic so you need space for both to co-exist temporarily. Could be done with a USB3 external drive as temporary storage.

Second hurdle is this needs to be done while the VM is powered down. And it could take a LONG time for 11TB of data to be written especially if using a basic external HDD.

Realistically I'd say just setup better monitoring. I'm actually a little confused how you got alerted at 500Gb free space yet it was already offline? Was the alert delayed or were you too busy to respond for a few hours? Even with the latter 500GB to a differencing disk in a few hours means this must be a very very busy server.
 
Realistically I'd say just setup better monitoring. I'm actually a little confused how you got alerted at 500Gb free space yet it was already offline? Was the alert delayed or were you too busy to respond for a few hours? Even with the latter 500GB to a differencing disk in a few hours means this must be a very very busy server.
Yep. Should read "By the time it alerted and I responded", but it really wasn't that long, 2-3 hours, I think. I blame Revit for the quick gains - those files are huge. I was just barely remoted into the host to check it out when the phone rang that they had lost access. The plan is still as I stated above. Reduce the size of the fixed vhdx so there is more wiggle room, and then setup better monitoring. It'll be fine and I won't make this mistake again.
 
Ah well can't blame anyone there. Sounds like monitoring done its job and while you took a few hours to respond can't say I'd have done any different. Seeing a ticket for under 500Gb wouldnt be urgent under normal circumstances.

Maybe an improvement could be to setup different levels of alerts. For example notification when it goes under 750GB, warning when under 500GB then critical when under 250GB. Then you can prioritize and act faster if needed.
 
Back
Top