"Floating point exception" in test cases

Open forum for general discussions relating to PLASMA.

Re: "Floating point exception" in test cases

Postby mateo70 » Tue Sep 10, 2013 5:35 am

Hello Jim,
Sorry for the long delay, prep for teachings took me some time during last weeks.
Could you please try the following patch to replace the environment variable. (PLASMA_NUM_THREADS_NUMA will still overwrite the automatic detection through HwLoc if required)
Thank you,

Code: Select all
Index: control/plasmaos-hwloc.c
===================================================================
--- control/plasmaos-hwloc.c   (révision 3607)
+++ control/plasmaos-hwloc.c   (copie de travail)
@@ -24,6 +24,7 @@
 #include <hwloc.h>
 
 static hwloc_topology_t plasma_topology = NULL; /* Topology object */
+static hwloc_topology_t plasma_full_topology = NULL; /* Topology object of the whole machine */
 static volatile int     plasma_nbr = 0;
 
 void plasma_topology_init(){
@@ -57,6 +58,11 @@
         hwloc_topology_destroy(plasma_topology);
 
         topo_initialized = 0;
+
+        if (plasma_full_topology != NULL) {
+            /* Destroy tpology */
+            hwloc_topology_destroy(plasma_full_topology);
+        }
     }
     pthread_mutex_unlock(&mutextopo);
 }
@@ -175,7 +181,21 @@
     hwloc_obj_t      obj;
     int thrdnbr = 1;
 
-    obj = hwloc_get_obj_by_type(plasma_topology, HWLOC_OBJ_NODE, 0);
+    if (plasma_full_topology == NULL) {
+
+        /* Allocate and initialize topology object.  */
+        hwloc_topology_init(&plasma_full_topology);
+
+        /* Set flag for the whole system */
+        hwloc_topology_set_flags(plasma_full_topology, HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM);
+
+        /* Perform the topology detection.  */
+        hwloc_topology_load(plasma_full_topology);
+
+        /* Get the number of cores (We don't want to use HyperThreading */
+        sys_corenbr = hwloc_get_nbobjs_by_type(plasma_full_topology, HWLOC_OBJ_CORE);
+    }
+    obj = hwloc_get_obj_by_type(plasma_full_topology, HWLOC_OBJ_NODE, 0);
 
     /* Get a copy of its cpuset that we may modify.  */
     if (obj != NULL) {
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: "Floating point exception" in test cases

Postby jimy_b » Tue Sep 10, 2013 11:18 am

Thanks for the patch.
I implemented it and recompiled. However it still crashes at the same place without the env var due to plasma->group_size==0.

Code: Select all
obj = hwloc_get_obj_by_type(plasma_full_topology, HWLOC_OBJ_NODE, 0);


This line still queries the zeroth node like before. I will query the OpenMPI people about this.

PS We have this set up because it's a 7 socket NUMA link machine so to stop the linux kernel from throwing threads doing system processes around to far off NUMA nodes we isolated the 1st socket just for the system level stuff.
jimy_b
 
Posts: 12
Joined: Wed Jul 31, 2013 1:06 pm

Re: "Floating point exception" in test cases

Postby mateo70 » Tue Sep 10, 2013 11:43 am

Ok, I missed something the line numa node L#0 in your output. Because looking at the connexion table at the end it seemed to me as it wasn't null. My bad.
I'll do a different patch then.
Mathieu
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: "Floating point exception" in test cases

Postby mateo70 » Tue Sep 10, 2013 12:19 pm

Ok, so I hope this time it will be ok, because I cannot check, I don't have any architecture like this one. And after discussion with Brice Goglin, that's the simplest solution we think of.
Mathieu

Code: Select all
Index: plasmaos-hwloc.c
===================================================================
--- plasmaos-hwloc.c   (révision 3609)
+++ plasmaos-hwloc.c   (copie de travail)
@@ -24,6 +24,7 @@
 #include <hwloc.h>
 
 static hwloc_topology_t plasma_topology = NULL; /* Topology object */
+static int plasma_hwloc_groupsize = -1; /* Size of NUMA nodes */
 static volatile int     plasma_nbr = 0;
 
 void plasma_topology_init(){
@@ -170,21 +171,52 @@
     return PLASMA_SUCCESS;
 }
 
-int plasma_getnuma_size() {
+int plasma_getnuma_size()
+{
+    if ( plasma_hwloc_groupsize == -1 ) {
+        hwloc_topology_t full_topology;
     hwloc_cpuset_t   cpuset;   /* HwLoc cpuset    */
     hwloc_obj_t      obj;
-    int thrdnbr = 1;
+        int nodesnbr, i;
 
-    obj = hwloc_get_obj_by_type(plasma_topology, HWLOC_OBJ_NODE, 0);
+        /* Allocate and initialize topology object.  */
+        hwloc_topology_init(&full_topology);
+
+        /* Set flag for the whole system */
+        hwloc_topology_set_flags(full_topology, HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM);
+
+        /* Perform the topology detection.  */
+        hwloc_topology_load(full_topology);
+
+        /* Compute number of NUMA nodes */
+        obj = hwloc_get_obj_by_type(full_topology, HWLOC_OBJ_MACHINE, 0);
 
-    /* Get a copy of its cpuset that we may modify.  */
     if (obj != NULL) {
 #if !defined(HWLOC_BITMAP_H)
       cpuset = hwloc_cpuset_dup(obj->cpuset);
 #else
       cpuset = hwloc_bitmap_dup(obj->cpuset);
 #endif
-      thrdnbr = hwloc_get_nbobjs_inside_cpuset_by_type(plasma_topology, cpuset, HWLOC_OBJ_CORE);
+            nodesnbr = hwloc_get_nbobjs_inside_cpuset_by_type( plasma_topology, cpuset, HWLOC_OBJ_NODE );
+        }
+        nodesnbr = (nodesnbr > 0) ? nodesnbr : 1;
+
+        /* Search size of NUMA nodes */
+        for(i=0; i<nodesnbr; i++)
+        {
+            obj = hwloc_get_obj_by_type(full_topology, HWLOC_OBJ_NODE, i);
+
+            if (obj != NULL) {
+#if !defined(HWLOC_BITMAP_H)
+                cpuset = hwloc_cpuset_dup(obj->cpuset);
+#else
+                cpuset = hwloc_bitmap_dup(obj->cpuset);
+#endif
+                plasma_hwloc_groupsize = hwloc_get_nbobjs_inside_cpuset_by_type( plasma_topology, cpuset, HWLOC_OBJ_CORE );
+                if (plasma_hwloc_groupsize > 0)
+                    break;
+            }
+        }
 
       /* Free our cpuset copy */
 #if !defined(HWLOC_BITMAP_H)
@@ -192,9 +224,13 @@
 #else
       hwloc_bitmap_free(cpuset);
 #endif
+
+        hwloc_topology_destroy(full_topology);
+
+        plasma_hwloc_groupsize = (plasma_hwloc_groupsize > 0) ? plasma_hwloc_groupsize : 1;
     }
 
-    return thrdnbr;
+    return plasma_hwloc_groupsize;
 }
 #ifdef __cplusplus
 }
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: "Floating point exception" in test cases

Postby jimy_b » Tue Sep 10, 2013 12:46 pm

That appears to have nailed it. Thanks a lot, Mateo. Looking forward to benching PLASMA on our system!

Also 'nodesnbr' needs to be initialized to 1.
jimy_b
 
Posts: 12
Joined: Wed Jul 31, 2013 1:06 pm

Re: "Floating point exception" in test cases

Postby mateo70 » Tue Sep 10, 2013 12:56 pm

Thanks. Good catch. I'll see with others if they have something to integrate but we have now fixed a few bug reports, so a patch release should come out soon.
Mathieu
mateo70
 
Posts: 98
Joined: Fri May 07, 2010 3:48 pm

Re: "Floating point exception" in test cases

Postby Megan23 » Mon Sep 16, 2013 4:46 am

Okay, well let me make sure I understand my own code before I try to fix it. The for lop will only execute if i > 0 right? Then the only time it will divide later is c= input%i So it should never divide by 0?
Megan23
 
Posts: 1
Joined: Mon Sep 16, 2013 4:43 am

Previous

Return to User discussion

Who is online

Users browsing this forum: Bing [Bot] and 2 guests

cron