VMware Snapshots – Some Thoughts and Best Practice

VMware

VMware gives the capability to “snapshot” a powered on or powered off virtual machine, preserving its state at a particular point in time. Although this is a very useful feature, is it not intended as a long term backup, archive or for example a pre-upgrade/change state backup. A snapshot allows an administrator to create a “version” of a virtual machine and then branch off from it to perform tests and then cleanly roll back to a known state easily, very useful in a development environment, but also it can be helpful when performing a potentially risky upgrade (or change) to the virtual machine allowing you to safely roll back the state of the virtual machine (any data changes are lost).

How Snapshots Work

There is large amounts of documentation available online relating to how snapshots actually perform their job, but a very quick summary is given below.

A virtual machine has virtual disk files that represent physical disks, these stored on a storage medium (e.g. a Storage Array), when the virtual machine reads and writes to its hard disk it is actually reading and writing to a virtual disk file (or VMDK in the case of VMware) that is stored on this storage medium. Let’s say we have a virtual machine called SERVER1 it has a single virtual disk called SERVER1.vmdk, in normal operation any writes/reads to/from this virtual disk are made to this SERVER1.vmdk file.

When a snapshot is opened, the disk SERVER1.vmdk is frozen (i.e. no read/writes are made to it), instead a delta disk is created called SERVER1-000001.vmdk (or similar), any changes are written to this disk; but unlike the original disk (i.e. SERVER1.vmdk) each new write (even if it is to the same block) takes up more space – a bit like a database transaction log, its a list of writes that have occurred, rather than writing over the top of the same block. What this means is that the delta file grows indefinitely so your SERVER1.vmdk disk might be 100GB, but if you have 20GB of changes written a day, then the delta file will grow by 20GB a day, so by the end of a week the SERVER1 virtual machine will be over 240GB in size. There a two impacts to this: 1. you use more and more disk space (and you might run out) and 2. you take a performance hit when writing because VMware needs to read through the original disk and any delta disks (you can have more than one if there are chained snapshots) before it can commit the write – the more snapshots and/or the bigger the snapshot delta the slower it gets! From this you can determine that its bad to leave a snapshot running indefinitely, there is also a risk to the integrity of the virtual machine, if the delta file is corrupted, the whole VM is lost.

If you want to roll-forward the snapshot (i.e. you want to persist the changes made since the SERVER1 main/base disk was frozen) then the delta disk is written back into the main disk from start to finish (like a transaction log), replying all the changes made into the base disk (as if they had really happened there), this is a very resource intensive action and can take a long time (even days) for a very big snapshot. One all the changes have been written in the SERVER1.vmdk is unfrozen and the delta file deleted. Reads and writes continue as normal with the persisting changes stored in the base/main disk.

If however you want to roll-back the snapshot (i.e. you want to discard the changes made since the SERVER1 main/base disk was frozen) then reads/writes continue on the base disk once again (the changes made in the delta file are lost as it is deleted) and the server is immediately reverted to the state it was at the time when the base disk was frozen.

Rolling Back a Chain of Virtual Machine Snapshots and Persisting the Changes

If you have a chain of snapshots on a virtual machine and you want to delete these and persist the changes, the process is the same as removing a snapshot and persisting the changes, except you need to be a bit more aware of performance issues when the snapshots are being removed.

A VMware sort of best practice is to remove the snapshots one by one starting with the oldest first and working through to the newest last. The reason for this is that only the last delta disk in the chain is getting active IO reading and writing from/to it. When you remove the first snapshot, IO is generated between the base/main disk and the first snapshot delta file, the later delta file (which is being actively used for IO) is not affected; meaning you can remove these snapshots while minimising the impact on the running Virtual Machine.

Essentially don’t use the “Delete All” button, you’ll lose all control over the process and once started if the VM suddenly starts to become unresponsive, it may be hours before it starts working again, and no you can’t interrupt it or the VM could become corrupted.

Leave a Reply

Your email address will not be published. Required fields are marked *