POSSIBLE DATA CORRUPTION ISSUE, MUST READ! Sorry for starting this way, but this is a big deal, and I wanted to make sure you catch this when scanning through the weekend spam. Apparently, VMware issued a support KB article on this issue a few weeks ago, but it totally flew under the radar (I have not seen a single tweet or blog about this). In short, any data flowing through the VM network stack may get corrupted - including file copies, remote clients interactions with databases, any client-server or multi-tiered apps.

The scariest part is that the scope of this issue is very significant. In fact, we might as well be facing the biggest data corruption issue in the history of virtualization. The issue may occur on any Windows Server 2012 VM with the default (E1000E) vNIC adaptor running on ESXi 5.0 and 5.1, which makes it probably around 20% of all VMs in the world. The easiest workaround is to change the vNIC type to VMXNET3 or E1000 (you should be able to apply this change in bulk with a PowerCLI script), or disable TCP Segmentation Offload in the guest operating system. Keep in mind that changing vNIC type may result in change of DHCP address, because the OS will see that as the new network adapter, so this may affect some applications. As such, disabling TCP Segmentation Offload may sometimes be a better choice, however this increases VM CPU usage.

Specifically to backups, even if some of your backup infrastructure components are running in a Windows Server 2012 VM, you should be safe if you are using Veeam Backup & Replication 6.5 or later. This was the version when we added inline network traffic verification to work around some unrelated data corruption issues involving faulty network equipment that we have observed in support. I had a big story about this in a weekly digest over one year ago. However, unfortunately your actual production data may already be corrupted, and unless you still have backups going all the way back to your vSphere 5.x or Windows Server 2012 upgrade times, this might be one of those cases of unrecoverable data loss... and worst of all, without running a compare against a copy of data that is known to be "good", it is impossible to say which specific parts of data are corrupted...

As per VMware support KB, the investigation is still on-going, so I would not yet jump to a conclusion that this is a bug with VMware. For example, we did see one mysterious data corruption issue during weeks of automated stress testing of our Windows Server 2012 support. We call it "10 bad bits mystery" internally, and it was affecting network transfers on both physical and virtual hardware. Unfortunately, the issue was impossible to reproduce reliably, so our investigation with Microsoft went nowhere (and we already had the problem covered with our network traffic verification anyway). But, if anyone from VMware R&D or support are reading this, feel free to reach out to me to discuss the data corruption pattern, as well as factors facilitating the issue surfacing – as this could be the same issue.

--> Please see the information in Gostev's weekly newsletter, a very interesting newsletter!