"Floating point exception" in test cases

Open forum for general discussions relating to PLASMA.

"Floating point exception" in test cases

Postby jimy_b » Wed Jul 31, 2013 1:15 pm

I downloaded and installed with out any problem. I used the download script with this:
Code: Select all
/setup.py --prefix="$HOME/local" --cc=icc --fc=ifort --blaslib="-L/opt/intel/beta/composer_xe_2013_sp1.0.051/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core"

I had to download LAPACKE from netlib.

After it installed, I tried the examples and the tests located in:
plasma-install/build/plasma/

Running the 'plasma_testing" script just says:

---- TESTING GECFI... FAILED(-8) !

And the same for all the different routines.

I tried to run the examples and it just outputs:
"Floating point exception"
Must have botched the install, but I don't know how.

I'm running on Linux, with Intel MKL with Intel Xeons.

Update:
More info. Ran it through gdb and I get:

Code: Select all
Program received signal SIGFPE, Arithmetic exception.
0x00000000004033f0 in PLASMA_Init_Affinity (cores=<optimized out>, coresbind=<optimized out>) at control.c:246
246         while ( ((plasma->world_size)%(plasma->group_size)) != 0 )


This is just after it calls plasma_get_numthreads_numa, which has calls to hwloc commands. I reckon that group_size is getting set to 0, resulting in division by zero. I think that hwloc is not configured properly.
jimy_b
 
Posts: 12
Joined: Wed Jul 31, 2013 1:06 pm

Re: "Floating point exception" in test cases

Postby jimy_b » Tue Aug 06, 2013 2:20 pm

For anyone else who comes across this bug.
I've found the culprit in a hwloc call when it tries to find the group size which I think is the number of threads per NUMA node. This is returning 0. I'm not sure if this is a bug with the hwloc.
I might be because on the machine I'm using, the first NUMA node is currently set to being part of the boot cpuset only so maybe hwloc queries that node and gets zero available cores back. This will require some more testing.

Anyway maybe there should be an if in the PLASMA_init routine to catch this if group_size comes back 0 for some reason, rather than just crashing with a floating point exception.

You can get around it by telling plasma the size of the NUMA node with the envar: PLASMA_NUM_THREADS_NUMA

Cheers.
jim
jimy_b
 
Posts: 12
Joined: Wed Jul 31, 2013 1:06 pm

Re: "Floating point exception" in test cases

Postby admin » Tue Aug 06, 2013 3:42 pm

Looks like our MPI guys are not aware of this problem.
Can you post to the hwloc mailing list (hwloc-users@open-mpi.org)?
Thanks,
Jakub
admin
Site Admin
 
Posts: 79
Joined: Wed May 13, 2009 1:27 pm

Re: "Floating point exception" in test cases

Postby mateo70 » Wed Aug 07, 2013 6:24 am

Hello Jim,
Could you send the result of hwloc-ls or lstopo command on your system please ?
Thank you
Mathieu
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: "Floating point exception" in test cases

Postby jimy_b » Thu Aug 08, 2013 12:24 pm

Hi
Thanks for your response.
Here is the output of hwloc-ls as a pdf:
https://www.dropbox.com/s/to5duraqb8kipkl/hwloc-ls.pdf

Cheers.
jimy_b
 
Posts: 12
Joined: Wed Jul 31, 2013 1:06 pm

Re: "Floating point exception" in test cases

Postby mateo70 » Fri Aug 09, 2013 7:11 am

Hello Jimy,

Thank you for the files. I have another request for you. Could you please send us the results of the three following commands please:

hwloc-ls --whole-system system.xml
hwloc-ls --whole-system whole-system.xml
hwloc-gather-topology system-topo

That would guide us to find the correct solution to the problem. Actually we are computing the numa node size by calling using the first numa node which is empty in your case. So we need to know if it is always empty, or if it is not when we create the topology for the whole system.
Thanks,
Mathieu
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: "Floating point exception" in test cases

Postby jimy_b » Mon Aug 12, 2013 6:07 am

Done.
Attached to this post.

Thanks,
Jim
Attachments
hwloc-output.tar.gz
(32.64 KiB) Downloaded 49 times
jimy_b
 
Posts: 12
Joined: Wed Jul 31, 2013 1:06 pm

Re: "Floating point exception" in test cases

Postby jimy_b » Mon Aug 19, 2013 11:29 am

bump
jimy_b
 
Posts: 12
Joined: Wed Jul 31, 2013 1:06 pm

Re: "Floating point exception" in test cases

Postby mateo70 » Mon Aug 19, 2013 11:40 am

Hello Jim,

Thank you for the archive. I was in vacation last week. The problem seems to be solvable with our first idea which is to ask for a full description of the machine first to detect the NUMA configuration, and then ask for the restricted topology that the user can access with his job for thread binding. I will try to prepare a fix quickly and as soon as I have it, I'll send it to you. While the problem is not fix, the best is to keep using the environment variable to set the NUMA group size. Even with the fix, it might be the best solution, since we still don't know the cost of detecting the full architecture of the machine. It should be quick, but it also might add an overhead to the PLASMA_Init function.

Regards,
Mathieu
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: "Floating point exception" in test cases

Postby jimy_b » Tue Aug 20, 2013 6:03 am

Thanks a lot!
jimy_b
 
Posts: 12
Joined: Wed Jul 31, 2013 1:06 pm

Next

Return to User discussion

Who is online

Users browsing this forum: No registered users and 2 guests

cron