George Bosilca

Innovative Computing Laboratory, University of Tennessee

(865) 974-6321 email: bosilca@eecs.utk.edu

 

 

Education and Training:

 

University of Paris XI Orsay, France

Math and Computer Science

B.S. 1999

University of Paris XI Orsay, France

Computer Science

Ph.D. 2003

University of Tennessee, ICL

Parallel Computing

Post Doc 2004-2005

 

Research and Professional Experience:

 

Research Asst. Professor, Innovative Computing Laboratory, University of Tennessee (2007-)

 

Adjunct Assistant Professor, University of Tennessee (2004 – present)

 

Research Scientist, Innovative Computing Laboratory, University of Tennessee (2005 - 2007)

 

Sr. Research Assoc., Innovative Computing Laboratory, University of Tennessee (2004 – 2005)

 

Research Assoc., Innovative Computing Laboratory, University of Tennessee (2003 – 2004)

 

 

Synergistic Activities:

 

á      Technical lead, release manager and active member of the Open MPI development team.

á      Active member of the MPI Forum.

á      Technical lead for the AtomS, System Noise and STCI Software Packages; Technical lead for the Fault Tolerant FT-MPI Library Development; and Technical lead for the MPICH-V

á      Architect and Technical Lead for DAGuE / DPLASMA.

 

Collaborators and Co-editors:

 

Emmanuel Agullo (INRIA, France), Brad Benton (IBM), Franck Cappello (INRIA Futur, France), Ralph Castain (LANL), D. Cronk (University of Tennessee), J. Dongarra (University of Tennessee), Terry Dontje (SUN/Oracle), G. Fagg (Microsoft), Patrick Geoffray (Myrinet), Brice Goglin (INRIA, France), Rich Graham (ORNL), Thomas Herault (INRIA Futur, France), Yutaka Ishikawa (University of Tokyo), Emmanuel Jeanot (INRIA, France), Andrew Lumsdaine (University of Indiana), Christine Morin (INRIA, France), Yves Robert (ENS, Lyon, France), Jeff Squyres (CISCO), Stan Tomov (University of Tennessee)

 

Graduate and Postdoctoral Advisors and Advisees

Graduate Students (past 5 years):

 

Daniel Andrzejewski, Thara Angskun, Wesley Bland, Kartheek V. Bodanki, Camille Coti, Jelena PjesivacGrbovic, Kusolchu Krerkchai, Narapat Saengpatsa, Gwang Son, Teng Ma, Wei Wu, Anthony Canino, Peter Gaultney, Peng Du,

 

Postdoctoral Associates (past 5 years):

 

Stephanie Moreaud, Anthony Danalis, Aurelien Bouteiller, Pierre Lemarinier, Yuan Tang

 

Thesis Advisor:

 

Dr. Franck Cappello, INRIA Futur, University of Paris XI Orsay and INRIA-Illinois Joint Laboratory on PetaScale Computing.

 

Publications:

 

Baboulin, M. , Becker, D., Bosilca, G., Danalis, A., Dongarra, J. "An efficient distributed randomized algorithm for solving large dense symmetric indefinite linear systems," Parallel Computing, By Costas Bekas, Ananth Grama, Olaf Schenk and Yousef Saad eds. 7th Workshop on Parallel Matrix Algorithms and Applications, Vol 40, Issue 7, 213-223, July, 2014 [bibtex]

Bosilca, G., Bouteiller, A., Herault, T., Robert, Y., Dongarra, J. "Assessing the Impact of ABFT and Checkpoint Composite Strategies," 16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014, IEEE, Phoenix, AZ, May, 2014 [pdf] [bibtex]

Lacoste, X., Faverge, M., Ramet, P., Thibault, S., Bosilca, G. "Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes," 23rd International Heterogeneity in Computing Workshop, IPDPS 2014, IEEE, Phoenix, AZ, May, 2014 [pdf] [bibtex]

Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., Dongarra, J., Guermouche, A., Herault, T., Robert, Y., Vivien, F., Zaidouni, D. "Unified Model for Assessing Checkpointing Protocols at Extreme-Scale," Concurrency and Computation: Practice and Experience, John Wiley & Sons, Ltd., November, 2013 [pdf] [bibtex]

Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., Dongarra, J. "PaRSEC: Exploiting Heterogeneity to Enhance Scalability," IEEE Computing in Science and Engineering, Vol. 15, No. 6, 36-45, November, 2013 [pdf] [bibtex]

Bosilca, G., Bouteiller, A., Herault, T., Robert, Y., and Jack Dongarra "Assessing the impact of {ABFT} and Checkpoint composite strategies," University of Tennessee Computer Science Technical Report, ICL-UT-13-03, September, 2013 [pdf] [bibtex]

Bland, W., Du, P., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. "Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI," Concurrency and Computation: Practice and Experience, July, 2013 [pdf] [bibtex]

Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. "Post-failure recovery of MPI communication capability: Design and Rationale," International Journal of High Performance Computing Applications, June, 2013 [pdf] [bibtex]

Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J. "An evaluation of User-Level Failure Mitigation support in MPI," Computing, Springer, Vienna, DOI 10.1007/s00607-013-0331-3, 1-14, May, 2013 [pdf] [bibtex]

Ma, T., Bosilca, G., Bouteiller, A., Dongarra, J. "Kernel-assisted and topology-aware MPI collective communications on multi-core/many-core platforms," Journal of Parallel and Distributed Computing, accepted, January, 2013 [pdf] [bibtex]

Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Kurzak, J., Luszczek, P., Tomov, S., and J. Dongarra "Scalable Dense Linear Algebra on Heterogeneous Hardware," HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing, IOS Press, 2013 [pdf] [bibtex]

Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. "Correlated Set Coordination in Fault Tolerant Message Logging Protocols," Concurrency and Computation: Practice and Experience, Vol. 25, No. 4, pp. 572-585, 2013 [pdf] [bibtex]

Agullo, E., Bosilca, G., Castagnède, C., Dongarra, J., Ltaief, H., Tomov, S. "Matrices Over Runtime Systems at Exascale," Supercomputing '12 (poster), Salt Lake City, Utah, November, 2012 [bibtex]

Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J. "An Evaluation of User-Level Failure Mitigation Support in MPI," Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Springer, Vienna, Austria, September 23 - 26, 2012 [pdf] [bibtex]

Bosilca, G., Dongarra, J., Ltaief, H. "Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems," Third International Conference on Energy-Aware High Performance Computing, Hamburg, Germany, September, 2012 [pdf] [bibtex]

Bland, W., Du, P., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. "A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI," 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award), Christos Kaklamanis, Theodore Papatheodorou and Paul Spirakis eds. Springer-Verlag, Rhodes, Greece, August 27-31, 2012 [pdf] [bibtex]

Baboulin, M., Becker, D., Bosilca, G., Danalis, A., Dongarra, J. "An efficient distributed randomized solver with application to large dense linear systems," ICL Technical Report, ICL-UT-12-02, July 11, 2012 [pdf] [bibtex]

Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., Dongarra, J., Guermouche, A., Herault, T., Robert, Y., Vivien, F., Zaidouni, D. "Unified Model for Assessing Checkpointing Protocols at Extreme-Scale," University of Tennessee Computer Science Technical Report (also LAWN 269), UT-CS-12-697, June, 2012 [pdf] [bibtex]

Ma, T., Bosilca, G., Bouteiller, A., Dongarra, J. "HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters," IPDPS 2012 (Best Paper), Shanghai, China, May, 2012 [pdf] [bibtex]

Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. "Correlated Set Coordination in Fault Tolerant Message Logging Protocols," Concurrency and Computation: Practice and Experience (accepted), March, 2012 [bibtex]

Du, P., Bouteiller, A., Bosilca, G., Herault, T., Dongarra, J. "Algorithm-Based Fault Tolerance for Dense Matrix Factorization," Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, J. Ramanujam, P. Sadayappan eds. ACM, New Orleans, LA, USA, 225-234, February 25-29, 2012 [pdf] [bibtex]

Bland, W., Bosilca, G., Bouteiller, A., Herault, T., Dongarra, J. "A Proposal for User-Level Failure Mitigation in the MPI-3 Standard," University of Tennessee Electrical Engineering and Computer Science Technical Report, ut-cs-12-693, February 24, 2012 [pdf] [bibtex]

Danalis, A., Bouteiller, A., Bosilca, G., Dongarra, J., Herault, T. "From Serial Loops to Parallel Execution on Distributed Systems," PPoPP 2012 (submitted), New Orleans, LA, February, 2012 [pdf] [bibtex]

Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Luszczek, P., Dongarra, J. "Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach," Scalable Computing and Communications: Theory and Practice, Khan, S., Wang, L., Zomaya, A. eds. John Wiley & Sons, 699-735, March, 2013 [bibtex]

Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J. "DAGuE: A generic distributed DAG Engine for High Performance Computing.," Parallel Computing, T. Hoefler eds. Elsevier, Vol. 38, No 1-2, 27-51, 2012 [pdf] [bibtex]

Bland, W., Du, P., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. "Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI," University of Tennessee Computer Science Technical Report, ut-cs-12-702, 2012 [pdf] [bibtex]

Bosilca, G., Herault, T., Rezmerita, A., Dongarra, J. "On Scalability for MPI Runtime Systems," International Conference on Cluster Computing (CLUSTER), IEEEE, Austin, TX, USA, 187-195, September 26-30, 2011 [pdf] [bibtex]

Bosilca, G., Herault, T., Lemarinier, P. Rezmerita, A., Dongarra, J. "Scalable Runtime for MPI: Efficiently Building the Communication Infrastructure," Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011, Yiannis Cotronis, Anthony Danalis, Dimitrios S. Nikolopoulos, Jack Dongarra eds. Springer, Santorini, Greece, LNCS 6960, 342-344, September 18-21, 2011 [pdf] [bibtex]

Ma, T., Bouteiller, A., Bosilca, G., Dongarra, J. "Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW," 18th EuroMPI, Cotronis, Y., Danalis, A., Nikolopoulos, D., Dongarra, J. eds. Springer, Santorini, Greece, pp. 247-254, September, 2011 [bibtex]

Chaarawi, M., Gabriel, E., Keller, R., Graham, R., Bosilca, G., Dongarra, J. "OMPIO: A Modular Software Architecture for MPI I/O," 18th EuroMPI, Cotronis, Y., Danalis, A., Nikolopoulos, D., Dongarra, J. eds. Springer, Santorini, Greece, pp. 81-89, September, 2011 [bibtex]

Ma, T., Bosilca, G., Bouteiller, A., Goglin, B., Squyres, J., Dongarra, J. "Kernel Assisted Collective Intra-node MPI Communication Among Multi-core and Many-core CPUs," Int'l Conference on Parallel Processing (ICPP '11), Taipei, Taiwan, September, 2011 [bibtex]

Bosilca, G., Herault, T., Rezmerita, A., Dongarra, J. "On Scalability for MPI Runtime Systems," Proceedings of the 2011 IEEE International Conference on Cluster Computing, IEEE Computer Society, Austin, TX, 187 - 195, September, 2011 [pdf] [bibtex]

Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. "Correlated Set Coordination in Fault Tolerant Message Logging Protocols," Proceedings of 17th International Conference, Euro-Par 2011, Part II, Emmanuel Jeannot, Raymond Namyst, Jean Roman eds. Springer, Bordeaux, France, LNCS Vol. 6853, 51-64, August 29 - September 2, 2011 [pdf] [bibtex]

Du, P., Bouteiller, A., Bosilca, G., Herault, T., Dongarra, J. "Algorithm-based Fault Tolerance for Dense Matrix Factorizations," University of Tennessee Computer Science Technical Report, Knoxville, TN, UT-CS-11-676, August 05, 2011 [pdf] [bibtex]

Bosilca, G., Bouteiller, A., Herault, T., Lemarier, P., Saengpatsa, N., Tomov, S., Dongarra, J. "Performance Portability of a GPU Enabled Factorization with the DAGuE Framework," IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC), June 24, 2011 [pdf] [bibtex]

Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Saengpatsa, N., Tomov, S., Dongarra, J. "A Unified HPC Environment for Hybrid Manycore/GPU Distributed Systems," IEEE International Parallel and Distributed Processing Symposium (submitted), Anchorage, AK, May 16-20, 2011 [bibtex]

Bosilca, G., Herault, T., Rezmerita, A., Dongarra, J. "On Scalability for MPI Runtime Systems," University of Tennessee Computer Science Technical Report, Knoxville, TN, ICL-UT-11-05, May 1, 2011 [pdf] [bibtex]

Ma, T., Herault, T., Bosilca, G., Dongarra, J. "Process Distance-aware Adaptive MPI Collective Communications," IEEE Int'l Conference on Cluster Computing (Cluster 2011), Austin, Texas, September, 2011 [bibtex]

Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J. "DAGuE: A Generic Distributed DAG Engine for High Performance Computing," Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops), IEEE, Anchorage, Alaska, USA, 1151-1158, 16-20 May, 2011 [bibtex]

Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaeif, H., Luszczek, P., YarKhan, A., Dongarra, J. "Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA," Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops), IEEE, Anchorage, Alaska, USA, 1432-1441, 16-20 May, 2011 [pdf] [bibtex]

Ma, T., Bosilca, G., Bouteiller, A., Goglin, B., Squyres, J., Dongarra, J. "Kernel Assisted Collective Intra-node Communication Among Multicore and Manycore CPUs," University of Tennessee Computer Science Technical Report, UT-CS-10-663, November, 2010 [pdf] [bibtex]

Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, H., Herault, T., Kurzak, J., Langou, J., Lemariner, P., Ltaief, H., Luszczek, P., YarKhan, A., Dongarra, J. "Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA," University of Tennessee Computer Science Technical Report, UT-CS-10-660, Sept. 15, 2010 [pdf] [bibtex]

Ma, T., Bouteiller, A., Bosilca, G., Dongarra, J. "Locality and Topology aware Intra-node Communication Among Multicore CPUs," Proceedings of the 17th EuroMPI conference, LNCS, Stuttgart, Germany, September, 2010 [pdf] [bibtex]

Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Dongarra, J. "Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols," Proceedings of EuroMPI 2010, Jack Dongarra, Michael Resch, Rainer Keller, Edgar Gabriel, eds. eds. Springer, Stuttgart, Germany, September, 2010 [pdf] [bibtex]

Bouteiller, A., Bosilca, G., Dongarra, J. "Redesigning the Message Logging Model for High Performance," Concurrency and Computation: Practice and Experience (online version), June 27, 2010 [pdf] [bibtex]

Turchenko, V., Grandinetti, L., Bosilca, G., Dongarra, J. "Improvement of parallelization efficiency of batch pattern BP training algorithm using Open MPI," Proceedings of International Conference on Computational Science, ICCS 2010 (to appear), Elsevier, Amsterdam The Netherlands, June, 2010 [pdf] [bibtex]

Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J. "DAGuE: A generic distributed DAG engine for high performance computing," Innovative Computing Laboratory Technical Report, ICL-UT-10-01, April 11, 2010 [pdf] [bibtex]

Angskun, T., Fagg, G., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J. "Self-Healing Network for Scalable Fault-Tolerant Runtime Environments," Future Generation Computer Systems, Vol. 26, Number 3, pp. 479-485, March, 2010 [pdf] [bibtex]

Bosilca, G., Bouteiller, A., Danalis, A, Faverge, M., Haidar, A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief, H., Luszczek, P., YarKhan, A., Dongarra, J. "Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project," Innovative Computing Laboratory Technical Report, ICL-UT-10-02, 2010 [pdf] [bibtex]

Bosilca, G., Coti, C., Herault, T., Lemarinier, P., Dongarra, J. "Constructing Resiliant Communication Infrastructure for Runtime Environments in Advances in Parallel Computing," in Advances in Parallel Computing - Parallel Computing: From Multicores and GPU's to Petascale, Chapman, B., Desprez, F., Joubert, G., Lichnewsky, A., Peters, F., Priol, T. Eds. eds. Volume 19, pp. 441-451, 2010 [bibtex]

Lemarinier, P., Bosilca, G., Coti, C., Herault, T., Dongarra, J. "Constructing Resilient Communication Infrastructure for Runtime Environments," ParCo 2009, Lyon France, September, 2009 [bibtex]

Bosilca, G., Coti, C., Herault, T., Lemarinier, P., Dongarra, J. "Constructing resiliant communication infrastructure for runtime environments," Innovative Computing Laboratory Technical Report, ICL-UT-09-02, July 31, 2009 [pdf] [bibtex]

Dongarra, J., Bosilca, G., Delmas, R., Langou, J. "Algorithmic Based Fault Tolerance Applied to High Performance Computing," Journal of Parallel and Distributed Computing, Volume 69, pp. 410-416, 2009 [pdf] [bibtex]

Bosilca, G., Delmas, R., Dongarra, J., Langou, J. "Algorithmic Based Fault Tolerance Applied to High Performance Computing," University of Tennessee Computer Science Technical Report, UT-CS-08-620 (also LAPACK Working Note 205), June 19, 2008 [pdf] [bibtex]

Bouteiller, A., Bosilca, G., Dongarra, J. "Redesigning the Message Logging Model for High Performance," International Supercomputer Conference (ISC 2008), Dresden, Germany, June 17, 2008 [pdf] [bibtex]

Angskun, T., Bosilca, G., Vander Zanden, B., Dongarra, J. "Optimal Routing in Binomial Graph Networks," The International Conference on Parallel and Distributed Computing, applications and Technologies (PDCAT), IEEE Computer Society, Adelaide, Australia, December 3-6, 2007 [bibtex]

Angskun, T., Bosilca, G., Dongarra, J. "Self-Healing in Binomial Graph Networks," 2nd International Workshop On Reliability in Decentralized Distributed Systems (RDDS 2007), Vilamoura, Algarve, Portugal, November, 2007 [pdf] [bibtex]

Bouteiller, A., Bosilca, G., Dongarra, J. "Retrospect: Deterministic Relay of MPI Applications for Interactive Distributed Debugging," Accepted for Euro PVM/MPI 2007, Springer, September, 2007 [bibtex]

Graham, R., Brightwell, R., Barrett, B., Bosilca, G., Pjesivac-Grbovic, J. "An Evaluation of Open MPI's Matching Transport Layer on the Cray XT," EuroPVM/MPI 2007, September, 2007 [bibtex]

Angskun, T., Bosilca, G., Dongarra, J. "Binomial Graph: A Scalable and Fault- Tolerant Logical Network Topology," Proceedings of The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07), Springer, Niagara Falls, Canada, August 29-30, 2007 [pdf] [bibtex]

Pjesivac-Grbovic, J., Bosilca, G., Fagg, G., Angskun, T., Dongarra, J. "Decision Trees and MPI Collective Algorithm Selection Problem," Euro-Par 2007, Springer, Rennes, France, 105--115, August, 2007 [pdf] [bibtex]

Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G., Gabriel, E., Dongarra, J. "Performance Analysis of MPI Collective Operations," Cluster computing, Springer Netherlands, Volume 10, Number 2, 127-143, June, 2007 [pdf] [bibtex]

Angskun, T., Bosilca, G., Fagg, G., Pjesivac-Grbovic, J., Dongarra, J. "Reliability Analysis of Self-Healing Network using Discrete-Event Simulation," Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07), IEEE Computer Society, 437-444, May, 2007 [bibtex]

Graham, R., Bosilca, G., Pjesivac-Grbovic, J. "A Comparison of Application Performance Using Open MPI and Cray MPI," Cray User Group, CUG 2007, May, 2007 [pdf] [bibtex]

Langou, J., Chen, Z., Bosilca, G., Dongarra, J., "Recovery Patterns for Iterative Methods in a Parallel Unstable Environment," SIAM SISC (to appear), May, 2007 [pdf] [bibtex]

Buttari, A., Luszczek, P., Kurzak, J., Dongarra, J., Bosilca, G. "SCOP3: A Rough Guide to Scientific Computing On the PlayStation 3," University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-595, April 17, 2007 [pdf] [bibtex]

Dongarra, J., Chen, Z., Bosilca, G., Langou, J. "Disaster Survival Guide in Petascale Computing: An Algorithmic Approach," in Petascale Computing: Algorithms and Applications (to appear), Chapman & Hall - CRC Press, 2007 [pdf] [bibtex]

Pjesivac--Grbovic, J., Bosilca, G., Fagg, G., Angskun, T., Dongarra, J. "MPI Collective Algorithm Selection and Quadtree Encoding," Parallel Computing (Special Edition: EuroPVM/MPI 2006), Elsevier, 2007 [pdf] [bibtex]

Pjesivac-Grbovic, J., Fagg, G., Angskun, T., Bosilca, G., Dongarra, J. "MPI Collective Algorithm Selection and Quadtree Encoding," Lecture Notes in Computer Science, Springer Berlin / Heidelberg, ICL-UT-06-13, Vol. 4192, Number 2006, pp. 40-48, September, 2006 [pdf] [bibtex]

Fagg, G., Pjesivac-Grbovic, J., Bosilca, G., Angskun, T., Dongarra, J. "Flexible collective communication tuning architecture applied to Open MPI," 2006 Euro PVM/MPI (submitted), Bonn, Germany, September, 2006 [pdf] [bibtex]

Angskun, T., Fagg, G., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J. "Self-Healing Network for Scalable Fault Tolerant Runtime Environments," DAPSYS 2006, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems, Innsbruck, Austria, September 21-23, 2006 [pdf] [bibtex]

Bosilca, G., Chen, Z., Dongarra, J., Eijkhout, V., Fagg, G., Fuentes, E., Langou, J., Luszczek, P., Pjesivac-Grbovic, J., Seymour, K., You, H., Vadhiyar, S. "Self Adapting Numerical Software SANS Effort," IBM Journal of Research and Development, Volume 50, number 2/3, pp. 223-238, 2006 [pdf] [bibtex]

Pjesivac-Grbovic, J., Fagg, G., Angskun, T., Bosilca, G., Dongarra, J. "MPI Collective Algorithm Selection and Quadtree Encoding," ICL Technical Report, ICL-UT-06-11, 2006 [pdf] [bibtex]

Angskun, T., Fagg, G., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J. "Scalable Fault Tolerant Protocol for Parallel Runtime Environments," 2006 Euro PVM/MPI, Bonn, Germany, ICL-UT-06-12, 2006 [pdf] [bibtex]

Fagg, G., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J. "Scalable Fault Tolerant MPI: Extending the Recovery Algorithm," Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI, Di Martino, B. et al. eds. Springer-Verlag Berlin, Sorrento (Naples) , Italy, LCNS 3666, pp. 67, September 18-21, 2005 [pdf] [bibtex]

Bosilca, G., Dongarra, J., Fagg, G., Langou, J. "Hash Functions for Datatype Signatures in MPI," Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI, Di Martino, B. et al. eds. Springer-Verlag Berlin, Sorrento (Naples), Italy, LCNS 3666, pp. 76-83, September 18-21, 2005 [pdf] [bibtex]

Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G., Gabriel, E., Dongarra, J. "Performance Analysis of MPI Collective Operations," 4th International Workshop on Performance Modeling, Evaluation, and Optmization of Parallel and Distributed Systems (PMEO-PDS '05), Denver, Colorado, April 4-8, 2005 [pdf] [bibtex]

Chen, Z., Fagg, G., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J. "Fault Tolerant High Performance Computing by a Coding Approach," Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear), Chicago, Illinois, June 15-17, 2005 [pdf] [bibtex]

Pjesivac-Grbovic, J., Angskun, Bosilca, G., Fagg, G., Gabriel, E., Dongarra, J. "Performance Analysis of MPI Collective Operations," Cluster Computing Journal (to appear), 2006 [pdf] [bibtex]

Bosilca, G., Chen, Z., Dongarra, J., Langou, J. "Recovery Patterns for Iterative Methods in a Parallel Unstable Environment," University of Tennessee Computer Science Department Technical Report, UT-CS-04-538, 2005 [pdf] [bibtex]

Fagg, G., Gabriel, E., Bosilca, G., Angskun, T., Chen, Z., Pjesivac-Grbovic, J., London, K., Dongarra, J. "Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems," Proceedings of ISC2004 (to appear), Heidelberg, Germany, June 23, 2004 [pdf] [bibtex]

Fagg, G., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J. "Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing," International Journal for High Performance Applications and Supercomputing (to appear), April, 2004 [pdf] [bibtex]

Bosilca, G., Chen, Z., Dongarra, J., Langou, J. "Recovery Patterns for Iterative Methods in a Parallel Unstable Environment," ICL Technical Report, ICL-UT-04-04, 2004 [pdf] [bibtex]

Fagg, G., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Bukovsky, A., Dongarra, J. "Fault Tolerant Communication Library and Applications for High Performance Computing," Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented), Santa Fe, NM, October 27-29, 2003 [pdf] [bibtex]