by Stan Tomov » Wed Sep 14, 2011 10:36 pm
Yes, most of the computation is done on the GPU but the critical path of many of the algorithms is done on the CPU. Therefore, the CPU code has to be as fast as possible. Currently, the code is tuned to get best performance if you use all the cores of a socket on your host. If you would like to use only one core, you must re-tuned the algorithms to get better performance (e.g., reduce the blocking sizes in file control/get_nb.cpp).