Use of CUDA Component

Open discussion of PAPI.

Use of CUDA Component

Postby openss-wdh » Thu Nov 13, 2014 2:37 am

Greetings Everyone,

I am attempt to add GPU HWPC data collection to the Open|SpeedShop CUDA experiment. Since we've used PAPI in some of our other data collectors, and found it to work quite well, I was hoping to utilize the PAPI CUDA component for this purpose. I've been able to successfully build PAPI 5.3.2 with the component enabled and the provided unit test works fine:

Code: Select all
[r219i0n0] ~/papi-5.3.2/src/components/cuda/tests: ./HelloWorld
WDH: ENTER CUDA_init_component() for pid=57607
WDH: EXIT CUDA_init_component() for pid=57607
PAPI_VERSION     :    5      3       2
Name cuda:::Tesla_M2090:domain_d:inst_executed --- Code: 0x4000002b
END: Hello World!
          14        --> cuda:::Tesla_M2090:domain_d:inst_executed
[r219i0n0] ~/papi-5.3.2/src/components/cuda/tests:


Note that I've added my own debug instrumentation into CUDA_init_component() because I've determined that is where I am running into issues. If I try to use this same build of PAPI with either the Open|SS cuda experiment or papiex I find it hangs the victim application:

Code: Select all
[r219i0n0] ~: papiex -e PAPI_TOT_CYC /u/whachfel/samples/bin/x86_64/linux/release/BlackScholes
WDH: ENTER CUDA_init_component() for pid=57426
WDH: EXIT CUDA_init_component() for pid=57426
WDH: ENTER CUDA_init_component() for pid=57428
... The Process Hangs ...
[r219i0n0] ~: osscuda "/u/whachfel/samples/bin/x86_64/linux/release/BlackScholes"
[openss]: cuda experiment using the default hardware event sampling configuration: "interval=10000000,TOT_CYC".
[openss]: cuda experiment calling osscollect.
Creating topology file for pbs frontend node r219i0n0
Generated topology file: ./cbtfAutoTopology
Running cuda collector.
Program: /u/whachfel/samples/bin/x86_64/linux/release/BlackScholes
Creating openss database: ./BlackScholes-cuda-1.openss
Number of mrnet backends: 1
Topology file used: ./cbtfAutoTopology
executing sequential program: cbtfrun -m -c cuda /u/whachfel/samples/bin/x86_64/linux/release/BlackScholes
WDH: ENTER CUDA_init_component() for pid=57488
... The Process Hangs ...


If I switch both of these tools to a build of PAPI that doesn't include the CUDA component, the tests above run successfully and produce performance data.

Any ideas what the issue might be? It seems likely that the papiex and Open|SS use of pre-main() initialization in combination with CUDA_init_component() is causing the CUDA library to hang. But I haven't the faintest idea why. Any advice you could give would be greatly appreciated!

-- Bill
openss-wdh
 
Posts: 6
Joined: Thu Nov 13, 2014 2:03 am

Re: Use of CUDA Component

Postby openss-wdh » Thu Nov 13, 2014 11:20 am

Sorry for the double post. After the first post the forum web system threw an exception and reported a bunch of errors, so I didn't think the first post (or this one for that matter) had gone through.

-- Bill
openss-wdh
 
Posts: 6
Joined: Thu Nov 13, 2014 2:03 am

Re: Use of CUDA Component

Postby yarkhan » Tue Nov 25, 2014 12:55 pm

Is this issue with the CUDA component still there?

I have looked at the CUDA component, and can see no obvious reason why the component would work from a PAPI test, but not from an OpenSS or papiex test,

If the problem still exists, could you please try a quick test.
Build PAPI with the CUDA component and make sure that there a no problems with CUDA initialization
Run a PAPI test trying to access a non-CUDA event and make sure that the output makes sense.
./papi_component_avail
./papi_command_line PAPI_TOT_CYC

Regards,
Asim YarKhan
UTK - ICL - PAPI
yarkhan
 
Posts: 12
Joined: Mon Aug 11, 2014 10:33 am

Re: Use of CUDA Component

Postby yarkhan » Tue Nov 25, 2014 12:58 pm

Is this issue with the CUDA component still there?

I have looked at the CUDA component, and can see no obvious reason why the component would work from a PAPI test, but not from an OpenSS or papiex test,

If the problem still exists, could you please try a quick test.
Build PAPI with the CUDA component and make sure that there a no problems with CUDA initialization
Run a PAPI test trying to access a non-CUDA event and make sure that the output makes sense.
./papi_component_avail
./papi_command_line PAPI_TOT_CYC

Regards,
Asim YarKhan
UTK - ICL - PAPI
yarkhan
 
Posts: 12
Joined: Mon Aug 11, 2014 10:33 am

Re: Use of CUDA Component

Postby openss-wdh » Mon Dec 01, 2014 10:57 pm

Hi Asim,

Thanks for the reply on my query. Yes. The issue is still present.

I believe the difference is that when OpenSS or papiex attempt to use PAPI with the CUDA component, they are initializing PAPI - and thus the CUDA library itself - before main() has been entered. I.e. both tools are using lib monitor to hook into the process before main() starts. In papi-5.3.2/src/components/cuda/tests/HelloWorld.cu (which is the test I was referring to) the PAPI library isn't initialized until post-main() when, presumably, the CUDA library has already been initialized. The CUDA library doesn't seem to like this early initialization coming from OpenSS/papiex before main() is entered.

I've worked around this in OpenSS - at least for the moment - by having the collector wait to initialize PAPI until after CUPTI indicates that a CUDA context has been created. This has the down side of preventing CPU-side collection of performance data prior to the construction of the first CUDA context. But otherwise it seems to work OK.

-- Bill
openss-wdh
 
Posts: 6
Joined: Thu Nov 13, 2014 2:03 am

Re: Use of CUDA Component

Postby openss-wdh » Tue Dec 02, 2014 2:41 am

Hi Asim,

I appear to have my OpenSS collector working now by delaying PAPI initialization until the first CUDA context is created. Now I've run into a different, but related, issue. It appears that PAPI will only allow me to monitor counters from the CPU or from the GPU. But not both concurrently? When I try to do this I'm getting:

Code: Select all
[r219i0n0] ~: osscuda "/u/whachfel/samples/bin/x86_64/linux/release/BlackScholes" interval=10000000,cuda:::Tesla_M2090:domain_d:inst_executed,ix86arch::INSTRUCTION_RETIRED

[openss]: cuda experiment using input hardware event sampling configuration specified on the "osscuda" command: "interval=10000000,cuda:::Tesla_M2090:domain_d:inst_executed,ix86arch::INSTRUCTION_RETIRED" overriding the default hardware event sampling configuration.
[openss]: cuda experiment calling osscollect.
Creating topology file for pbs frontend node r219i0n0
Generated topology file: ./cbtfAutoTopology
Running cuda collector.
Program: /u/whachfel/samples/bin/x86_64/linux/release/BlackScholes
Creating openss database: ./BlackScholes-cuda-0.openss
Number of mrnet backends: 1
Topology file used: ./cbtfAutoTopology
executing sequential program: cbtfrun -m -c cuda /u/whachfel/samples/bin/x86_64/linux/release/BlackScholes
[CBTF/CUDA] cbtf_collector_start()
[CBTF/CUDA] cbtf_collector_start(): thread_count.value = 0 --> 1
[CBTF/CUDA] parse_configuration("interval=10000000,cuda:::Tesla_M2090:domain_d:inst_executed,ix86arch::INSTRUCTION_RETIRED")
[CBTF/CUDA] parse_configuration(): sampling interval = 10000000 nS
[CBTF/CUDA] parse_configuration(): event name = "cuda:::Tesla_M2090:domain_d:inst_executed"
[CBTF/CUDA] parse_configuration(): event name = "ix86arch::INSTRUCTION_RETIRED"
[CBTF/CUDA] start_papi_data_collection()
[/u/whachfel/samples/bin/x86_64/linux/release/BlackScholes] - Starting...
GPU Device 0: "Tesla M2090" with compute capability 2.0

Initializing data...
...allocating CPU memory for options.
...allocating GPU memory for options.
[CBTF/CUDA] cbtf_collector_start()
[CBTF/CUDA] cbtf_collector_start(): thread_count.value = 1 --> 2
[CBTF/CUDA] start_papi_data_collection()
[CBTF/CUDA] start_papi(): context_count.value = 0 --> 1
[CBTF/CUDA] start_papi_data_collection()
[CBTF/CUDA] start_papi_data_collection(): PAPI_add_event(tls->papi_event_set, event_code) = -1 (Invalid argument)
[CBTF/CUDA] cbtf_collector_stop()
[CBTF/CUDA] send_data(): sending CBTF_cuda_data message (2 msg, 0 pc)
cbtf_collector_send DATA for r219i0n0.p4.nas.nasa.gov:26523:46912701372160:-1:
time_range[18446744073709551615, 0) addr range [0xffffffffffffffff, 0]
[CBTF/CUDA] cbtf_collector_stop(): thread_count.value = 2 --> 1
[CBTF/CUDA] cbtf_collector_stop()
[CBTF/CUDA] stop_papi_data_collection()
^C
[r219i0n0] ~:


Note that in the above I am trying to monitor the native events "cuda:::Tesla_M2090:domain_d:inst_executed" and "ix86arch::INSTRUCTION_RETIRED" concurrently. If I perform a run that only uses one or the other counter, it works fine. And if I run with, say, PAPI_TOT_CYC and PAPI_TOT_INS, it also works fine. The limitation appears to be only when one tries to combine a CPU and GPU event at the same time. PAPI_add_event() is returning me a -1 ("Invalid argument").

Do you know whether this is, in fact, an actual restriction of PAPI? It would be unfortunate if that is the case because one of the things we were hoping to do was look at whether the CPU was being "well utilized" during the execution of CUDA kernels.

Thanks again for your help!

-- Bill
openss-wdh
 
Posts: 6
Joined: Thu Nov 13, 2014 2:03 am

Re: Use of CUDA Component

Postby yarkhan » Wed Dec 03, 2014 3:01 pm

I am just addressing your issue with measuring CPU and CUDA events at the same time. For each component that you are measuring, you need to use a different PAPI EventSet. You cannot mix events from different components in the same PAPI EventSet. For example:

PAPI_create_eventset(&eventSet1);
PAPI_add_event(eventSet1, EventCode_for_CPU_event);
// Using a CPU event automatically binds the eventset to the CPU component

PAPI_create_eventset(&eventSet2);
PAPI_add_event(eventSet2, EventCode_for_CUDA_event);
// Using the GPU event binds the eventset to the CUDA component.

ret = PAPI_start(eventSet1);
ret = PAPI_start(eventSet2);

From http://icl.cs.utk.edu/projects/papi/fil ... API-C.html
"PAPI-C extends the concept of an EventSet by binding it to a specific numbered Component. This component index then signals which component the EventSet is paired with. Multiple EventSets can be defined and active simultaneously, but only one EventSet per Component can be enabled. "

Asim
yarkhan
 
Posts: 12
Joined: Mon Aug 11, 2014 10:33 am

Re: Use of CUDA Component

Postby openss-wdh » Thu Dec 04, 2014 2:57 am

Hi Asim,

Many thanks for your help once again! I totally missed the subtlety that an event set can only be associated with a single component. RTFM I guess. More than once. ;) Tonight I restructured our OpenSS CUDA collector code to use multiple event sets as needed and I am now able to periodically sample counts from both CPU and GPU events concurrently. Fantastic!

I still believe there is an issue with the PAPI CUDA component where, if it is initialized before the application initializes CUDA, the CUDA library can hang. But that issue is no longer blocking us now that I've delayed PAPI initialization until the first CUDA context is created.

-- Bill
openss-wdh
 
Posts: 6
Joined: Thu Nov 13, 2014 2:03 am


Return to General discussion (read-only)

Who is online

Users browsing this forum: No registered users and 3 guests

cron