I have a Problem on the Xeon Phi with loading the papi.h header file. I cross compiled the papi library on the host as described in the new papi version 5.3 and tested ./papi_avail on the MIC.
- Code: Select all
PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses
PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses
PAPI_TLB_DM 0x80000014 Yes No Data translation lookaside buffer misses
PAPI_TLB_IM 0x80000015 Yes No Instruction translation lookaside buffer misses
PAPI_L2_LDM 0x80000019 Yes No Level 2 load misses
PAPI_BR_MSP 0x8000002e Yes No Conditional branch instructions mispredicted
PAPI_TOT_INS 0x80000032 Yes No Instructions completed
PAPI_LD_INS 0x80000035 Yes No Load instructions
PAPI_SR_INS 0x80000036 Yes No Store instructions
PAPI_BR_INS 0x80000037 Yes No Branch instructions
PAPI_VEC_INS 0x80000038 Yes No Vector/SIMD instructions (could include integer)
PAPI_TOT_CYC 0x8000003b Yes No Total cycles
PAPI_L1_DCA 0x80000040 Yes No Level 1 data cache accesses
PAPI_L1_ICA 0x8000004c Yes No Level 1 instruction cache accesses
so it works... Then I tried to compile my offload code with papi.h included as follows:
- Code: Select all
#pragma offload_attribute (push,target(mic))
#include "papi.h"
#pragma offload_attribute (pop)
I compiled the code with:
- Code: Select all
icpc -openmp -O3 -offload-option,mic,ld,/home/$(ACCOUNT_NAME)/lib/papi/lib/libpapi.a foo.cpp -o foo.exe
and got the following error:
- Code: Select all
foo.cpp(11): catastrophic error: cannot open source file "papi.h"
#include "papi.h"
^
compilation aborted for foo.cpp (code 4)
so my questions is - how can I include the header file? It doesn't work for me as described in PAPI 5.3.
Some informations about the MIC I use:
- Code: Select all
--------------------------------------------------------------------------------
PAPI Version : 5.3.0.0
Vendor string and code : GenuineIntel (1)
Model string and code : 0b/01 (1)
CPU Revision : 3.000000
CPUID Info : Family: 11 Model: 1 Stepping: 3
CPU Max Megahertz : 1052
CPU Min Megahertz : 842
Hdw Threads per core : 4
Cores per Socket : 60
Sockets : 1
CPUs per Node : 240
Total CPUs : 240
Running in a VM : no
Number Hardware Counters : 2
Max Multiplex Counters : 64
--------------------------------------------------------------------------------
Hope you can help me

Here is my full code:
- Code: Select all
#include <iostream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <cstddef>
#include <omp.h>
#include <immintrin.h>
#include <zmmintrin.h>
#pragma offload_attribute (push,target(mic))
#include "papi.h"
#pragma offload_attribute (pop)
void init( float* x, float* y, float* z, std::size_t n );
int main(int argc, char** argv) {
float sum = 0.0f;
const float a = 1.5f;
std::size_t n = 4096 * 125;
__attribute__((target(mic))) float *x;
__attribute__((target(mic))) float *y;
__attribute__((target(mic))) float *z;
posix_memalign((void**) &x, 4096, n * sizeof(float) );
posix_memalign((void**) &y, 4096, n * sizeof(float) );
posix_memalign((void**) &z, 4096, n * sizeof(float) );
init( x, y, z, n );
#pragma offload target(mic) \
in( n, a ) \
in( x:length(n) align(4096) alloc_if(1) free_if(1) ) \
in( y:length(n) align(4096) alloc_if(1) free_if(1) ) \
inout( z:length(n) align(4096) alloc_if(1) free_if(1) )
{
omp_set_num_threads(240);
#pragma omp parallel
{
__m512 x_, y_, z_, a_;
a_ = _mm512_set1_ps( a );
#pragma omp for schedule(static)
for ( std::size_t j = 0; j < n; j += 16 ){
x_ = _mm512_load_ps( x + j );
y_ = _mm512_load_ps( y + j );
y_ = _mm512_fmadd_ps ( a_, x_, y_);
_mm512_store_ps( z + j, y_ );
}
}
}
//prevent compiler optimizations for unused variables
for ( std::size_t j = 0; j < n; j++ ) sum += z[j];
std::cout << std::endl << sum << std::endl;
free(x);
free(y);
free(z);
return 0;
}
inline void init( float* x, float* y, float* z, std::size_t n ) {
for (std::size_t i = 0; i < n; i++){
x[i] = 1.2f * (float)i;
y[i] = 4.3f * (float)i;
z[i] = 0.0f;
}
}