The VM Shielding Repair Garage – Part 1

Monday , 15, August 2016 Leave a comment

I’ll preface this blog by saying that if you’re here to learn the basics of VM Shielding then this probably isn’t for you. If you’re already familiar with the concepts of VM Shielding and Guarded Fabrics and want to learn more about how to recover a stricken Shielded VM, then read on!

One of the core concepts of a Shielded VM is that a fabric admin should not, cannot, and will not ever be able to gain access to a tenant VM for any reason. This is brilliant from a security perspective (and unique feature in Hyper-V 2016 compared to all other hypervisors/public clouds), but when it comes to troubleshooting can definitely raise a few eyebrows.

I often get asked why we can’t just temporarily un-shield and then re-shield a VM after troubleshooting, and the answer is that this fundamentally breaks the trust model of VM Shielding as that VM could not ever be trusted to be uncompromised, so that’s not a feature or function available.

So if a tenant borks the networking in their VM, or reboots it and it fails to come back up, or it crashes, or a whole plethora of other scenarios happen that breaks remote access over RDP, SSH and the like, then originally your only option would be to restore from backup.

Thanks to the advent of nested Hyper-V however, we have a new option available to us which empowers the tenant to repair a VM themselves, without ever compromising the trust model of it being Shielded.

Enter: The Repair Garage. All scripts referenced in this blog are available through this link, unless otherwise specifically noted.

The Repair Garage concept allows a tenant to bring a Shielded VM inside another Shielded VM which is also a nested and guarded Hyper-V host, un-shield it, console on to the stricken VM and repair it, re-shield it, and return it to the main fabric, all without it ever being exposed to the fabric admins at any time.

repair-garage

Ok, it’s a theory, but there aren’t exactly a plethora of Guarded Fabrics available in the world to test it on – fortunately we have a production-ready and fully featured TP5 one at our disposal, from TPMv2 to WAP, so testing-ho!
For this testing we have set up a three node Hyper-V cluster of Dell R630s, each host fitted with TPMv2 chips, set up as a Guarded Fabric managed by VMM2016TP5, and actively able to run fully Shielded VMs.

Within this environment we set up a new Cloud for the purposes of testing, and enable it for VM Shielding, then deploy a VM that we presciently name ‘Stricken VM’.

As expected, I can RDP to the VM using my signed RDP file.

Once connected over RDP, I disable the NIC in order to ruin my access to it. At this point, there is no way to regain access to the VM through traditional means, be it Console, PowerShell Direct, or other.

As we see, I as a Fabric Admin cannot console on to the Stricken VM to repair it. Oh balls.

The first stage in recovering this VM is deploying a new Shielded VM to function as a nested Hyper-V host, or a ‘Repair Garage’ as Microsoft term it.

IMPORTANT: These VMs need to be connected to the same vSwitch and on the same Host.

If Nested Virtualisation isn’t enabled on your host, enable it with bcdedit /set {current} hypervisorloadoptions OFFERNESTEDVIRT and reboot.

Please, please, please, make sure that your Repair Garage VM has all available updates installed. If it doesn’t, there is a very high chance that it will all go tits up later on.

Next we enable Nested Virtualisation on the Repair Garage VM using the script at https://github.com/Microsoft/Virtualization-Documentation/blob/master/hyperv-tools/Nested/Enable-NestedVm.ps1

We can check whether all is set up correctly using the following script on the host:

https://raw.githubusercontent.com/Microsoft/Virtualization-Documentation/master/hyperv-tools/Nested/Get-NestedVirtStatus.ps1

Our Repair Garage is indeed ready to be a nested virtualisation host, so onwards we go!

On the host, run the script StartShieldedVMRecoveryOnFabric.ps1 as an Administrator.

The process kicks off, and you hold your breath…

… and it fails. Every time for me. Until I realised that the script is dependent on your Stricken VM’s disk being Dynamic, not Fixed, so a quick convert to Dynamic later and we’re up and running again…

Note that at line 78, the script attaches an exported version of the Stricken VM’s OS drive to SCSI Controller 0, Location 1 of the Repair Garage. If you have an ISO or data disk attached to your Repair Garage, this will cause it to fail as the slot will be occupied.

If all goes well, you should get this output:

… and you can hopefully see the recovery VHDX attached to the Recovery VM.

Taking on the role of the tenant now, I RDP into the Repair Garage VM and check that the recovery disk is attached and offline.

Next, we run the PrepareShieldedVMTroubleshooting.ps1 script from the documentation, which will do a whole lot of stuff which will result in the Stricken VM starting as a VM nested within the Shielded Recovery Garage. In theory. The script claims to install Hyper-V on the Repair Garage VM, but it doesn’t, so install that manually first and reboot, then wipe your brow when you can RDP back in successfully.

Next we run the PrepareShieldedVMTroubleshooting.ps1 script provided in the documentation, grit our teeth, pray to the old gods and the new, and again breathe a sigh of relief when it succeeds.

This brings the data disk online…

… imports the VM into Hyper-V in the Repair Garage…

… creates C:\Certs, and populates it with a temporary recovery guardian certificate and a key protector file.

These should be copied to the Hyper-V host on which the Repair Garage and Stricken VM reside, after which we run the ‘GrantShieldedVMRecoveryGuardian.ps1’ script, which should generate a new Key Protector, but unfortunately at this stage it fails.

I’ve spent some time troubleshooting this and haven’t been able to make any headway yet – it fails at the point in Grant-HgsKeyProtectorAccess (in HGSClient module) where it passes Key Protector and Guardian info to the MSFT_HgsKeyProtector Class to Grant access, and from debugging all fields are being correctly populated and passed, unfortunately it’s failing with this Index out of range error every time.

So a few lessons learned so far, and I’m confident part 2 will see this resolved and then on we push as there are but a few steps left 🙂

Edit: I’ve had confirmation that this issue is a bug in TP5 which is fixed in RTM.

Leave a Reply

%d bloggers like this: