I am attempt to add GPU HWPC data collection to the Open|SpeedShop CUDA experiment. Since we've used PAPI in some of our other data collectors, and found it to work quite well, I was hoping to utilize the PAPI CUDA component for this purpose. I've been able to successfully build PAPI 5.3.2 with the component enabled and the provided unit test works fine:
- Code: Select all
[r219i0n0] ~/papi-5.3.2/src/components/cuda/tests: ./HelloWorld
WDH: ENTER CUDA_init_component() for pid=57607
WDH: EXIT CUDA_init_component() for pid=57607
PAPI_VERSION : 5 3 2
Name cuda:::Tesla_M2090:domain_d:inst_executed --- Code: 0x4000002b
END: Hello World!
14 --> cuda:::Tesla_M2090:domain_d:inst_executed
[r219i0n0] ~/papi-5.3.2/src/components/cuda/tests:
Note that I've added my own debug instrumentation into CUDA_init_component() because I've determined that is where I am running into issues. If I try to use this same build of PAPI with either the Open|SS cuda experiment or papiex I find it hangs the victim application:
- Code: Select all
[r219i0n0] ~: papiex -e PAPI_TOT_CYC /u/whachfel/samples/bin/x86_64/linux/release/BlackScholes
WDH: ENTER CUDA_init_component() for pid=57426
WDH: EXIT CUDA_init_component() for pid=57426
WDH: ENTER CUDA_init_component() for pid=57428
... The Process Hangs ...
[r219i0n0] ~: osscuda "/u/whachfel/samples/bin/x86_64/linux/release/BlackScholes"
[openss]: cuda experiment using the default hardware event sampling configuration: "interval=10000000,TOT_CYC".
[openss]: cuda experiment calling osscollect.
Creating topology file for pbs frontend node r219i0n0
Generated topology file: ./cbtfAutoTopology
Running cuda collector.
Program: /u/whachfel/samples/bin/x86_64/linux/release/BlackScholes
Creating openss database: ./BlackScholes-cuda-1.openss
Number of mrnet backends: 1
Topology file used: ./cbtfAutoTopology
executing sequential program: cbtfrun -m -c cuda /u/whachfel/samples/bin/x86_64/linux/release/BlackScholes
WDH: ENTER CUDA_init_component() for pid=57488
... The Process Hangs ...
If I switch both of these tools to a build of PAPI that doesn't include the CUDA component, the tests above run successfully and produce performance data.
Any ideas what the issue might be? It seems likely that the papiex and Open|SS use of pre-main() initialization in combination with CUDA_init_component() is causing the CUDA library to hang. But I haven't the faintest idea why. Any advice you could give would be greatly appreciated!
-- Bill