PAPITopics:Virtual PAPI Performance
From PAPIDocs
Jump to: navigation, search

Testing Environment

Bare Metal

  • 16-core Intel Xeon CPU @ 2.9GHz
  • 64 GB memory
  • Ubuntu Server 12.04
  • Linux kernel 3.6

KVM

  • Qemu version 1.2.0
  • Guest VM - Ubuntu Server 12.04
  • Guest VM - Linux kernel 3.6
  • Guest VM - 16GB ram
  • More information on installing PAPI on kvm available at PAPITopics:PAPI on KVM

VMware

  • VMware ESXi 5.1
  • Guest VM - Ubuntu Server 12.04
  • Guest VM - Linux kernel 3.6
  • Guest VM - 16GB ram
  • More information on installing PAPI on kvm available at PAPITopics:PAPI on VMware

Test Suite

Tests performed are taken from the Mantevo Suite. The purpose of the Mantevo suite is to provide miniapplications which mimic performance characteristics of real world large scale applications. The applications listed below were used for testing.

CloverLeaf

CloverLeaf is a mini-app that solves the compressible Euler equations on a Cartesian grid, using an explicit, second-order accurate method. Each cell stores three values: energy, density, and pressure. A velocity vector is stored at each cell corner. This arrangement of data, with some quantities at cell centers, and others at cell corners is known as a staggered grid. CloverLeaf currently solves the equations in two dimensions. The computation in CloverLeaf has been broken down into "kernels" — low level building blocks with minimal complexity. Each kernel loops over the entire grid and updates one (or some) mesh variables, based on a kernel-dependent computational stencil. Control logic within each kernel is kept to a minimum , allowing maximum optimisation by the compiler. Memory is sacrificed in order to increase peformance, and any updates to variables that would introduce dependencies between loop iterations are written into copies of the mesh.

CoMD

CoMD is a reference implementation of classical molecular dynamics algorithms and workloads as used in materials science. It is created and maintained by The Exascale Co-Design Center for Materials in Extreme Environments (ExMatEx).

HPCCG

HPCCG: A simple conjugate gradient benchmark code for a 3D chimney domain on an arbitrary number of processors. Author: Michael A. Heroux, Sandia National Laboratories (maherou@sandia.gov) This simple benchmark code is a self-contained piece of C++ software that generates a 27-point finite difference matrix with a user-prescribed sub-block size on each processor. It is implemented to be very scalable (in a weak sense). Any reasonable parallel computer should be able to achieve excellent scaled speedup (weak scaling). Kernel performance should be reasonable, but no attempts have been made to provide special kernel optimizations.

MiniGhost

miniGhost is a Finite Difference mini-application which implements a difference stencil across a homogenous three dimensional domain.

Thus the kernels that it contains are: - computation of stencil options, - inter-process boundary (halo, ghost) exchange. - global summation of grid values.

Computation simulates heat diffusion across a homogenous domain, with Dirichlet boundary conditions. It does not currently solve a specific problem that can be checked for correctness. However, it can be run in a mode that does check correctness in a limited sense: the domain is initialized to 0, a heat source is applied to a single cell in the middle of the global domain, and the grid values are summed and compared with the initial source term.

Testing Procedure

For each platform (Bare metal, KVM, VMware), each application was run while measuring 1 PAPI preset event. Source code for each application was modified to place measurements surrounding the main computational work of each application. Ten runs were performed for each event. Tests were run on a "quiet" system; no other users were logged on and only minimal OS services were running at the same time. Tests were performed in succession with no reboots in betweeen.

Results

Reading Results

Results are presented as box plots. Along the x-axis is a PAPI preset event. Along the y-axis is the ratio of event counts to bare metal event counts. That is, the mean of 10 runs on the virtual machine is ratioed with the mean of 10 runs on a bare metal system. Therefore, a ratio of 1 corresponds to an identical number of event counts. A ratio of 2 corresponds to twice as many counts on the virtual machine as on bare metal. We would expect in nearly all cases for the ratio to be greater than 1 due to the overhead of virtualization with a few exceptions. As can be seen on the graphs, most boxes appear as a straight horizontal line. This happens because the standard deviation is so low on these events compared to others that the whiskers on the plot appear to overlap the box due to the y-axis scale.

VMware

Cloverleaf

Virtperf.vmware.CLOVERLEAF.png

CoMD

Virtperf.vmware.COMD.png

HPCCG

Virtperf.vmware.HPCCG.png

MiniGhost

Virtperf.vmware.MINIGHOST.png

MiniXyce

Virtperf.vmware.MINIXYCE.png

KVM

Cloverleaf

Virtperf.kvm.CLOVERLEAF.png

CoMD

Virtperf.kvm.COMD.png

HPCCG

Virtperf.kvm.HPCCG.png

MiniGhost

Virtperf.kvm.MINIGHOST.png

MiniXyce

Virtperf.kvm.MINIXYCE.png

Observations

On inspection of the results, there are two main classes of events which exhibit ratios significantly different than 1. These two classes include instruction cache events, such as PAPI_L1_ICM, and translation lookaside buffer events such as PAPI_TLB_IM. These two classes will be examined more closely below. Another anomoly which has yet to be explained is KVM reporting non-zero counts of PAPI_VEC_DP and PAPI_VEC_SP (both related to vector operations), whereas bare metal always reports 0.

Results for MiniXYCE appear much less consistent than other tests in the test suite. However, these results also show instruction cache and TLB as the events most different on the VM than on bare metal.

Instruction Cache

From the results, we can see that instruction cache event counts are much more frequent on the virtual machines than on bare metal. These events include: PAPI_L1_ICM, PAPI_L2_ICM, PAPI_L2_ICA, PAPI_L3_ICA, PAPI_L2_ICR, and PAPI_L3_ICR. By a simple deduction, we can see that L2 events are directly related to PAPI_L1_ICM, and likewise L3 events to PAPI_L2_ICM. That is, the level 2 cache will only be accessed in the event of a level 1 cache miss. Miss rate will compound total accesses at each level, and as a result, the L3 instruction cache events appear the most different than on bare metal with a ratio of over 2. Therefore, it is most pertinent to examine the PAPI_L1_ICM results as all other instruction cache events are directly related. Below we can see the results of the HPCCG tests on bare metal, kvm and vmware side by side. As can be seen on the graph, both kvm and vmware underperform the bare metal results. However, vmware is quite a bit better off with only 20% more misses than bare metal whereas kvm exhibits nearly 40% more misses. Both have a large standard deviation than bare metal, but not by a huge margin.

L1icm.png

TLB

In the graph below (using data from the HPCCG application), we can see that Data TLB misses are a huge issue for both KVM and VMware. Both exhibit around 33 times more misses than runs on bare metal. There is little difference between the two virtualization platforms for this event.

Dtlb.png

Instruction TLB misses are also significantly more frequent on both virtualization platforms than on bare metal. However, VMware seems to fair much better in this regard. Not only does VMware incur 50% of the misses seen on KVM, VMware also has a much smaller standard deviation (even smaller than that of bare metal) compared to KVM's extremely unpredictable results.

Itlb.png