Use of CUDA Component

Open discussion of PAPI.

Use of CUDA Component

Postby openss-wdh » Thu Nov 13, 2014 2:27 am

Greetings,

I'm working on adding GPU HWPC information to the CUDA collector in Open|SS (http://www.openspeedshop.org/wp/). And I'd like to base this capability on the PAPI CUDA component since we have used PAPI elsewhere in Open|SS and found it to work quite well. I have PAPI 5.3.2 built with the CUDA component enabled and it works great from the unit test:

Code: Select all
[r219i0n0] ~/papi-5.3.2/src/components/cuda/tests: ./HelloWorld
WDH: ENTER CUDA_init_component() for pid=57607
WDH: EXIT CUDA_init_component() for pid=57607
PAPI_VERSION     :    5      3       2
Name cuda:::Tesla_M2090:domain_d:inst_executed --- Code: 0x4000002b
END: Hello World!
          14        --> cuda:::Tesla_M2090:domain_d:inst_executed
[r219i0n0] ~/papi-5.3.2/src/components/cuda/tests:


Note that I've added my own debug "instrumentation" to the CUDA component printing the entry and exits from the CUDA_init_component() function. When I try to use PAPI from our Open|SS collector, I find it hangs the victim application:

Code: Select all
[r219i0n0] ~: osscuda "/u/whachfel/samples/bin/x86_64/linux/release/BlackScholes"
[openss]: cuda experiment using the default hardware event sampling configuration: "interval=10000000,TOT_CYC".
[openss]: cuda experiment calling osscollect.
Creating topology file for pbs frontend node r219i0n0
Generated topology file: ./cbtfAutoTopology
Running cuda collector.
Program: /u/whachfel/samples/bin/x86_64/linux/release/BlackScholes
Creating openss database: ./BlackScholes-cuda-1.openss
Number of mrnet backends: 1
Topology file used: ./cbtfAutoTopology
executing sequential program: cbtfrun -m -c cuda /u/whachfel/samples/bin/x86_64/linux/release/BlackScholes
WDH: ENTER CUDA_init_component() for pid=57488
... The Process Hangs ...


I've also downloaded and built papiex pointing to the same PAPI 5.3.2 used above. This also fails in a nearly identical way:

Code: Select all
[r219i0n0] ~: papiex -e PAPI_TOT_CYC /u/whachfel/samples/bin/x86_64/linux/release/BlackScholes
WDH: ENTER CUDA_init_component() for pid=57426
WDH: EXIT CUDA_init_component() for pid=57426
WDH: ENTER CUDA_init_component() for pid=57428
... The Process Hangs ...


In both cases I'm not even actually using a CUDA counter - just PAPI_TOT_CYC. But the inclusion of the CUDA component is enough to hang the victim application. If I switch to a separate PAPI 5.3.2 build that does NOT include the CUDA component, both of the above Open|SS and papiex tests work just fine.

It seems likely that somehow the way that Open|SS and papiex use pre-main() hooks to initialize the PAPI CUDA component causes a problem with CUDA initializing properly. A sample stack trace of the hung process looks like this:

Code: Select all
#0  0x00002aaaab3316ab in raise () from /lib64/libpthread.so.0
#1  0x00002aaaaaeee862 in monitor_signal_handler (sig=20, info=0x7fffffffc6f0,
    context=0x7fffffffc5c0) at signal.c:229
#2  <signal handler called>
#3  0x00002aaaab32fb37 in sem_timedwait () from /lib64/libpthread.so.0
#4  0x00002aaaad11e713 in ?? () from /usr/lib64/libcuda.so
#5  0x00002aaaaca9286f in ?? () from /usr/lib64/libcuda.so
#6  0x00002aaaaca92a39 in ?? () from /usr/lib64/libcuda.so
#7  0x00002aaaaca7d75e in ?? () from /usr/lib64/libcuda.so
#8  0x00002aaaaca7da22 in ?? () from /usr/lib64/libcuda.so
#9  0x00002aaaaca7e0b9 in ?? () from /usr/lib64/libcuda.so
#10 0x00002aaaac9df9ed in ?? () from /usr/lib64/libcuda.so
#11 0x00002aaaac9e02a2 in ?? () from /usr/lib64/libcuda.so
#12 0x00002aaaac9c9773 in ?? () from /usr/lib64/libcuda.so
#13 0x00002aaaac99b2c1 in cuCtxCreate_v2 () from /usr/lib64/libcuda.so
#14 0x00002aaaac27ecb6 in CUDA_init_component (cidx=<optimized out>)
    at components/cuda/linux-cuda.c:551
#15 0x00002aaaac269234 in _papi_hwi_init_global () at papi_internal.c:1705
#16 0x00002aaaac267f75 in PAPI_library_init (version=<optimized out>,
    version@entry=84082688) at papi.c:613
#17 0x00002aaaaacd5a58 in papiex_process_init_routine ()
    at /u/whachfel/papiex/src/papiex.c:1308
#18 monitor_init_process (argc=argc@entry=0x2aaaab0fb398 <monitor_argc>,
    argv=<optimized out>, data=data@entry=0x0)
    at /u/whachfel/papiex/src/papiex.c:2654
#19 0x00002aaaaaee8116 in monitor_begin_process_fcn (
    user_data=user_data@entry=0x0, is_fork=is_fork@entry=0) at main.c:285
#20 0x00002aaaaaee845a in monitor_main (argc=1, argv=0x7fffffffda08,
    envp=0x7fffffffda18) at main.c:505
#21 0x00002aaaabef8c36 in __libc_start_main () from /lib64/libc.so.6
#22 0x00002aaaaaee87be in __libc_start_main (main=<optimized out>, argc=1,
    argv=0x7fffffffda08, init=0x44e270 <__libc_csu_init>,
    fini=0x44e260 <__libc_csu_fini>, rtld_fini=0x2aaaaaab9670 <_dl_fini>,
    stack_end=0x7fffffffd9f8) at main.c:556
#23 0x0000000000402ec9 in _start () at ../sysdeps/x86_64/elf/start.S:113


Yet the Open|SS CUDA collector, which uses CUPTI directly to trace kernel executions, etc. and initializes CUPTI pre-main(), works fine when the PAPI CUDA component is not included in the PAPI build. But doing that, of course, limits us to CPU-side HWPC information.

Any ideas what the issue might be? Thanks in advance for any help you can provide!

-- Bill
openss-wdh
 
Posts: 6
Joined: Thu Nov 13, 2014 2:03 am

Return to General discussion (read-only)

Who is online

Users browsing this forum: No registered users and 1 guest

cron