George Bosilca

Position

Currently I'm working as a Consultant and Adjunct Assistant Professor at the Innovative Computing Laboratory at University of Tennessee, Knoxville. I can be joined at 865-974-6321 or in person in my office (Claxton 308).









Projects

DAGuE - DAG unified Environment

DAGuE aims at enabling scientific computing on large scale distributed environments featuring many cores, accelerators and high speed networks.

CCI - Common Communication Infrastructure

CCI introduce a novel communication API that both supports many features that have become standard (or otherwise generally expected) in other communication interfaces, and strives to export a small, yet powerful, interface. This new interface draws upon years of experience from network-oriented software devel- opment best practices to systems-level implementations. The goal is to create a relatively simple, high-level communication interface with low barriers to adoption while still providing important features such as scalability, resiliency, and performance. The result is the Common Communications Interface (CCI): an intuitive API that is portable, efficient, scalable, and robust to meet the needs of network-intensive applications common in HPC and cloud computing.

Open MPI

Open MPI is a project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI) in order to build the best MPI library available. A completely new MPI-2 compliant implementation, Open MPI offers advantages for system and software vendors, application developers and computer science researchers.

FT-MPI

HARNESS (Heterogeneous Adaptive Reconfigurable Networked SyStem) is an experimental Metacomputing System aiming at providing a highly dynamic, fault-tolerant computing environment for high performance computing applications. To make the HARNESS system more accessible to the user community a HARNESS MPI API has been developed, known as FT-MPI.

MPICH-V

MPICH-V is a research effort with theoretical studies, experimental evaluations and pragmatic implementations aiming to provide a MPI implementation based on MPICH, featuring multiple fault tolerant protocols. MPICH-V provides automatic fault tolerant MPI library (i.e. a totaly unchanged application linked with the mpich-v library is a fault tolerant application).

Bosilca, G., Bouteiller, A., Herault, T., Robert, Y., Dongarra, J. "Assessing the Impact of ABFT and Checkpoint Composite Strategies," IPDPSW, APDCM 2014, Phoenix, AZ, May, 2014 [bibtex]
Lacoste, X., Faverge, M., Ramet, P., Thibault, S., Bosilca, G. "Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes," IPDPSW, HCW 2014, Phoenix, AZ, May, 2014 [bibtex]
Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., Dongarra, J., Guermouche, A., Herault, T., Robert, Y., Vivien, F., Zaidouni, D. "Unified Model for Assessing Checkpointing Protocols at Extreme-Scale," Concurrency and Computation: Practice and Experience, John Wiley & Sons, Ltd., November, 2013 [pdf] [bibtex]
Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., Dongarra, J. "PaRSEC: Exploiting Heterogeneity to Enhance Scalability," IEEE Computing in Science and Engineering, Vol. 15, No. 6, 36-45, November, 2013 [pdf] [bibtex]
Bosilca, G., Bouteiller, A., Herault, T., Robert, Y., and Jack Dongarra "Assessing the impact of {ABFT} and Checkpoint composite strategies," University of Tennessee Computer Science Technical Report, ICL-UT-13-03, September, 2013 [pdf] [bibtex]
Bland, W., Du, P., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. "Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI," Concurrency and Computation: Practice and Experience, July, 2013 [pdf] [bibtex]
Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. "Post-failure recovery of MPI communication capability: Design and Rationale," International Journal of High Performance Computing Applications, June, 2013 [pdf] [bibtex]
Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J.J. "An evaluation of User-Level Failure Mitigation support in MPI," Computing, Springer, Vienna, DOI 10.1007/s00607-013-0331-3, 1-14, May, 2013 [pdf] [bibtex]
Ma, T., Bosilca, G., Bouteiller, A., Dongarra, J. "Kernel-assisted and topology-aware MPI collective communications on multi-core/many-core platforms," Journal of Parallel and Distributed Computing, accepted, January, 2013 [pdf] [bibtex]
Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Kurzak, J., Luszczek, P., Tomov, S., and J. Dongarra "Scalable Dense Linear Algebra on Heterogeneous Hardware," HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing, IOS Press, 2013 [pdf] [bibtex]
Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. "Correlated Set Coordination in Fault Tolerant Message Logging Protocols," Concurrency and Computation: Practice and Experience, Vol. 25, No. 4, pp. 572-585, 2013 [pdf] [bibtex]
Agullo, E., Bosilca, G., Castagn├Ęde, C., Dongarra, J., Ltaief, H., Tomov, S. "Matrices Over Runtime Systems at Exascale," Supercomputing '12 (poster), Salt Lake City, Utah, November, 2012 [bibtex]
Bland, W., Bouteiller, A., Herault, T., Hursey, J., Bosilca, G., Dongarra, J. "An Evaluation of User-Level Failure Mitigation Support in MPI," Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Springer, Vienna, Austria, September 23 - 26, 2012 [pdf] [bibtex]
Bosilca, G., Dongarra, J., Ltaief, H. "Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems," Third International Conference on Energy-Aware High Performance Computing, Hamburg, Germany, September, 2012 [pdf] [bibtex]
Bland, W., Du, P., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. "A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI," 18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award), Christos Kaklamanis, Theodore Papatheodorou and Paul Spirakis eds. Springer-Verlag, Rhodes, Greece, August 27-31, 2012 [pdf] [bibtex]
Baboulin, M., Becker, D., Bosilca, G., Danalis, A., Dongarra, J. "An efficient distributed randomized solver with application to large dense linear systems," ICL Technical Report, ICL-UT-12-02, July 11, 2012 [pdf] [bibtex]
Bosilca, G., Bouteiller, A., Brunet, E., Cappello, F., Dongarra, J., Guermouche, A., Herault, T., Robert, Y., Vivien, F., Zaidouni, D. "Unified Model for Assessing Checkpointing Protocols at Extreme-Scale," University of Tennessee Computer Science Technical Report (also LAWN 269), UT-CS-12-697, June, 2012 [pdf] [bibtex]
Ma, T., Bosilca, G., Bouteiller, A., Dongarra, J. "HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters," IPDPS 2012 (Best Paper), Shanghai, China, May, 2012 [pdf] [bibtex]
Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. "Correlated Set Coordination in Fault Tolerant Message Logging Protocols," Concurrency and Computation: Practice and Experience (accepted), March, 2012 [bibtex]
Du, P., Bouteiller, A., Bosilca, G., Herault, T., Dongarra, J. "Algorithm-Based Fault Tolerance for Dense Matrix Factorization," Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, J. Ramanujam, P. Sadayappan eds. ACM, New Orleans, LA, USA, 225-234, February 25-29, 2012 [pdf] [bibtex]
Bland, W., Bosilca, G., Bouteiller, A., Herault, T., Dongarra, J. "A Proposal for User-Level Failure Mitigation in the MPI-3 Standard," University of Tennessee Electrical Engineering and Computer Science Technical Report, ut-cs-12-693, February 24, 2012 [pdf] [bibtex]
Danalis, A., Bouteiller, A., Bosilca, G., Dongarra, J., Herault, T. "From Serial Loops to Parallel Execution on Distributed Systems," PPoPP 2012 (submitted), New Orleans, LA, February, 2012 [pdf] [bibtex]
Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Luszczek, P., Dongarra, J. "Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach," Scalable Computing and Communications: Theory and Practice, Khan, S., Wang, L., Zomaya, A. eds. John Wiley & Sons, 699-735, March, 2013 [bibtex]
Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J. "DAGuE: A generic distributed DAG Engine for High Performance Computing.," Parallel Computing, T. Hoefler eds. Elsevier, Vol. 38, No 1-2, 27-51, 2012 [pdf] [bibtex]
Bland, W., Du, P., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. "Extending the Scope of the Checkpoint-on-Failure Protocol for Forward Recovery in Standard MPI," University of Tennessee Computer Science Technical Report, ut-cs-12-702, 2012 [pdf] [bibtex]
Bosilca, G., Herault, T., Rezmerita, A., Dongarra, J. "On Scalability for MPI Runtime Systems," International Conference on Cluster Computing (CLUSTER), IEEEE, Austin, TX, USA, 187-195, September 26-30, 2011 [pdf] [bibtex]
Bosilca, G., Herault, T., Lemarinier, P. Rezmerita, A., Dongarra, J. "Scalable Runtime for MPI: Efficiently Building the Communication Infrastructure," Proceedings of Recent Advances in the Message Passing Interface - 18th European MPI Users' Group Meeting, EuroMPI 2011, Yiannis Cotronis, Anthony Danalis, Dimitrios S. Nikolopoulos, Jack Dongarra eds. Springer, Santorini, Greece, LNCS 6960, 342-344, September 18-21, 2011 [pdf] [bibtex]
Ma, T., Bouteiller, A., Bosilca, G., Dongarra, J. "Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW," 18th EuroMPI, Cotronis, Y., Danalis, A., Nikolopoulos, D., Dongarra, J. eds. Springer, Santorini, Greece, pp. 247-254, September, 2011 [bibtex]
Chaarawi, M., Gabriel, E., Keller, R., Graham, R., Bosilca, G., Dongarra, J. "OMPIO: A Modular Software Architecture for MPI I/O," 18th EuroMPI, Cotronis, Y., Danalis, A., Nikolopoulos, D., Dongarra, J. eds. Springer, Santorini, Greece, pp. 81-89, September, 2011 [bibtex]
Ma, T., Bosilca, G., Bouteiller, A., Goglin, B., Squyres, J., Dongarra, J. "Kernel Assisted Collective Intra-node MPI Communication Among Multi-core and Many-core CPUs," Int'l Conference on Parallel Processing (ICPP '11), Taipei, Taiwan, September, 2011 [bibtex]
Bosilca, G., Herault, T., Rezmerita, A., Dongarra, J. "On Scalability for MPI Runtime Systems," Proceedings of the 2011 IEEE International Conference on Cluster Computing, IEEE Computer Society, Austin, TX, 187 - 195, September, 2011 [pdf] [bibtex]
Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J. "Correlated Set Coordination in Fault Tolerant Message Logging Protocols," Proceedings of 17th International Conference, Euro-Par 2011, Part II, Emmanuel Jeannot, Raymond Namyst, Jean Roman eds. Springer, Bordeaux, France, LNCS Vol. 6853, 51-64, August 29 - September 2, 2011 [pdf] [bibtex]
Du, P., Bouteiller, A., Bosilca, G., Herault, T., Dongarra, J. "Algorithm-based Fault Tolerance for Dense Matrix Factorizations," University of Tennessee Computer Science Technical Report, Knoxville, TN, UT-CS-11-676, August 05, 2011 [pdf] [bibtex]
Bosilca, G., Bouteiller, A., Herault, T., Lemarier, P., Saengpatsa, N., Tomov, S., Dongarra, J. "Performance Portability of a GPU Enabled Factorization with the DAGuE Framework," IEEE Cluster: workshop on Parallel Programming on Accelerator Clusters (PPAC), June 24, 2011 [pdf] [bibtex]
Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Saengpatsa, N., Tomov, S., Dongarra, J. "A Unified HPC Environment for Hybrid Manycore/GPU Distributed Systems," IEEE International Parallel and Distributed Processing Symposium (submitted), Anchorage, AK, May 16-20, 2011 [bibtex]
Bosilca, G., Herault, T., Rezmerita, A., Dongarra, J. "On Scalability for MPI Runtime Systems," University of Tennessee Computer Science Technical Report, Knoxville, TN, ICL-UT-11-05, May 1, 2011 [pdf] [bibtex]
Ma, T., Herault, T., Bosilca, G., Dongarra, J. "Process Distance-aware Adaptive MPI Collective Communications," IEEE Int'l Conference on Cluster Computing (Cluster 2011), Austin, Texas, September, 2011 [bibtex]
Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J. "DAGuE: A Generic Distributed DAG Engine for High Performance Computing," Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops), IEEE, Anchorage, Alaska, USA, 1151-1158, 16-20 May, 2011 [bibtex]
Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaeif, H., Luszczek, P., YarKhan, A., Dongarra, J. "Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA," Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops), IEEE, Anchorage, Alaska, USA, 1432-1441, 16-20 May, 2011 [pdf] [bibtex]
Ma, T., Bosilca, G., Bouteiller, A., Goglin, B., Squyres, J., Dongarra, J. "Kernel Assisted Collective Intra-node Communication Among Multicore and Manycore CPUs," University of Tennessee Computer Science Technical Report, UT-CS-10-663, November, 2010 [pdf] [bibtex]
Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, H., Herault, T., Kurzak, J., Langou, J., Lemariner, P., Ltaief, H., Luszczek, P., YarKhan, A., Dongarra, J. "Distributed Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA," University of Tennessee Computer Science Technical Report, UT-CS-10-660, Sept. 15, 2010 [pdf] [bibtex]
Ma, T., Bouteiller, A., Bosilca, G., Dongarra, J. "Locality and Topology aware Intra-node Communication Among Multicore CPUs," Proceedings of the 17th EuroMPI conference, LNCS, Stuttgart, Germany, September, 2010 [pdf] [bibtex]
Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Dongarra, J. "Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols," Proceedings of EuroMPI 2010, Jack Dongarra, Michael Resch, Rainer Keller, Edgar Gabriel, eds. eds. Springer, Stuttgart, Germany, September, 2010 [pdf] [bibtex]
Bouteiller, A., Bosilca, G., Dongarra, J. "Redesigning the Message Logging Model for High Performance," Concurrency and Computation: Practice and Experience (online version), June 27, 2010 [pdf] [bibtex]
Turchenko, V., Grandinetti, L., Bosilca, G., Dongarra, J. "Improvement of parallelization efficiency of batch pattern BP training algorithm using Open MPI," Proceedings of International Conference on Computational Science, ICCS 2010 (to appear), Elsevier, Amsterdam The Netherlands, June, 2010 [pdf] [bibtex]
Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J. "DAGuE: A generic distributed DAG engine for high performance computing," Innovative Computing Laboratory Technical Report, ICL-UT-10-01, April 11, 2010 [pdf] [bibtex]
Angskun, T., Fagg, G., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J. "Self-Healing Network for Scalable Fault-Tolerant Runtime Environments," Future Generation Computer Systems, Vol. 26, Number 3, pp. 479-485, March, 2010 [pdf] [bibtex]
Bosilca, G., Bouteiller, A., Danalis, A, Faverge, M., Haidar, A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief, H., Luszczek, P., YarKhan, A., Dongarra, J. "Distributed-Memory Task Execution and Dependence Tracking within DAGuE and the DPLASMA Project," Innovative Computing Laboratory Technical Report, ICL-UT-10-02, 2010 [pdf] [bibtex]
Bosilca, G., Coti, C., Herault, T., Lemarinier, P., Dongarra, J. "Constructing Resiliant Communication Infrastructure for Runtime Environments in Advances in Parallel Computing," in Advances in Parallel Computing - Parallel Computing: From Multicores and GPU's to Petascale, Chapman, B., Desprez, F., Joubert, G., Lichnewsky, A., Peters, F., Priol, T. Eds. eds. Volume 19, pp. 441-451, 2010 [bibtex]
Lemarinier, P., Bosilca, G., Coti, C., Herault, T., Dongarra, J. "Constructing Resilient Communication Infrastructure for Runtime Environments," ParCo 2009, Lyon France, September, 2009 [bibtex]
Bosilca, G., Coti, C., Herault, T., Lemarinier, P., Dongarra, J. "Constructing resiliant communication infrastructure for runtime environments," Innovative Computing Laboratory Technical Report, ICL-UT-09-02, July 31, 2009 [pdf] [bibtex]
Dongarra, J., Bosilca, G., Delmas, R., Langou, J. "Algorithmic Based Fault Tolerance Applied to High Performance Computing," Journal of Parallel and Distributed Computing, Volume 69, pp. 410-416, 2009 [pdf] [bibtex]
Bosilca, G., Delmas, R., Dongarra, J., Langou, J. "Algorithmic Based Fault Tolerance Applied to High Performance Computing," University of Tennessee Computer Science Technical Report, UT-CS-08-620 (also LAPACK Working Note 205), June 19, 2008 [pdf] [bibtex]
Bouteiller, A., Bosilca, G., Dongarra, J. "Redesigning the Message Logging Model for High Performance," International Supercomputer Conference (ISC 2008), Dresden, Germany, June 17, 2008 [pdf] [bibtex]
Angskun, T., Bosilca, G., Vander Zanden, B., Dongarra, J. "Optimal Routing in Binomial Graph Networks," The International Conference on Parallel and Distributed Computing, applications and Technologies (PDCAT), IEEE Computer Society, Adelaide, Australia, December 3-6, 2007 [bibtex]
Angskun, T., Bosilca, G., Dongarra, J. "Self-Healing in Binomial Graph Networks," 2nd International Workshop On Reliability in Decentralized Distributed Systems (RDDS 2007), Vilamoura, Algarve, Portugal, November, 2007 [pdf] [bibtex]
Bouteiller, A., Bosilca, G., Dongarra, J. "Retrospect: Deterministic Relay of MPI Applications for Interactive Distributed Debugging," Accepted for Euro PVM/MPI 2007, Springer, September, 2007 [bibtex]
Graham, R., Brightwell, R., Barrett, B., Bosilca, G., Pjesivac-Grbovic, J. "An Evaluation of Open MPI's Matching Transport Layer on the Cray XT," EuroPVM/MPI 2007, September, 2007 [bibtex]
Angskun, T., Bosilca, G., Dongarra, J. "Binomial Graph: A Scalable and Fault- Tolerant Logical Network Topology," Proceedings of The Fifth International Symposium on Parallel and Distributed Processing and Applications (ISPA07), Springer, Niagara Falls, Canada, August 29-30, 2007 [pdf] [bibtex]
Pjesivac-Grbovic, J., Bosilca, G., Fagg, G., Angskun, T., Dongarra, J. "Decision Trees and MPI Collective Algorithm Selection Problem," Euro-Par 2007, Springer, Rennes, France, 105--115, August, 2007 [pdf] [bibtex]
Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G., Gabriel, E., Dongarra, J. "Performance Analysis of MPI Collective Operations," Cluster computing, Springer Netherlands, Volume 10, Number 2, 127-143, June, 2007 [pdf] [bibtex]
Angskun, T., Bosilca, G., Fagg, G., Pjesivac-Grbovic, J., Dongarra, J. "Reliability Analysis of Self-Healing Network using Discrete-Event Simulation," Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07), IEEE Computer Society, 437-444, May, 2007 [bibtex]
Graham, R., Bosilca, G., Pjesivac-Grbovic, J. "A Comparison of Application Performance Using Open MPI and Cray MPI," Cray User Group, CUG 2007, May, 2007 [pdf] [bibtex]
Langou, J., Chen, Z., Bosilca, G., Dongarra, J., "Recovery Patterns for Iterative Methods in a Parallel Unstable Environment," SIAM SISC (to appear), May, 2007 [pdf] [bibtex]
Buttari, A., Luszczek, P., Kurzak, J., Dongarra, J., Bosilca, G. "SCOP3: A Rough Guide to Scientific Computing On the PlayStation 3," University of Tennessee Computer Science Dept. Technical Report, UT-CS-07-595, April 17, 2007 [pdf] [bibtex]
Dongarra, J., Chen, Z., Bosilca, G., Langou, J. "Disaster Survival Guide in Petascale Computing: An Algorithmic Approach," in Petascale Computing: Algorithms and Applications (to appear), Chapman & Hall - CRC Press, 2007 [pdf] [bibtex]
Pjesivac--Grbovic, J., Bosilca, G., Fagg, G., Angskun, T., Dongarra, J. "MPI Collective Algorithm Selection and Quadtree Encoding," Parallel Computing (Special Edition: EuroPVM/MPI 2006), Elsevier, 2007 [pdf] [bibtex]
Pjesivac-Grbovic, J., Fagg, G., Angskun, T., Bosilca, G., Dongarra, J. "MPI Collective Algorithm Selection and Quadtree Encoding," Lecture Notes in Computer Science, Springer Berlin / Heidelberg, ICL-UT-06-13, Vol. 4192, Number 2006, pp. 40-48, September, 2006 [pdf] [bibtex]
Fagg, G., Pjesivac-Grbovic, J., Bosilca, G., Angskun, T., Dongarra, J. "Flexible collective communication tuning architecture applied to Open MPI," 2006 Euro PVM/MPI (submitted), Bonn, Germany, September, 2006 [pdf] [bibtex]
Angskun, T., Fagg, G., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J. "Self-Healing Network for Scalable Fault Tolerant Runtime Environments," DAPSYS 2006, 6th Austrian-Hungarian Workshop on Distributed and Parallel Systems, Innsbruck, Austria, September 21-23, 2006 [pdf] [bibtex]
Bosilca, G., Chen, Z., Dongarra, J., Eijkhout, V., Fagg, G., Fuentes, E., Langou, J., Luszczek, P., Pjesivac-Grbovic, J., Seymour, K., You, H., Vadhiyar, S. "Self Adapting Numerical Software SANS Effort," IBM Journal of Research and Development, Volume 50, number 2/3, pp. 223-238, 2006 [pdf] [bibtex]
Pjesivac-Grbovic, J., Fagg, G., Angskun, T., Bosilca, G., Dongarra, J. "MPI Collective Algorithm Selection and Quadtree Encoding," ICL Technical Report, ICL-UT-06-11, 2006 [pdf] [bibtex]
Angskun, T., Fagg, G., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J. "Scalable Fault Tolerant Protocol for Parallel Runtime Environments," 2006 Euro PVM/MPI, Bonn, Germany, ICL-UT-06-12, 2006 [pdf] [bibtex]
Fagg, G., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J. "Scalable Fault Tolerant MPI: Extending the Recovery Algorithm," Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI, Di Martino, B. et al. eds. Springer-Verlag Berlin, Sorrento (Naples) , Italy, LCNS 3666, pp. 67, September 18-21, 2005 [pdf] [bibtex]
Bosilca, G., Dongarra, J., Fagg, G., Langou, J. "Hash Functions for Datatype Signatures in MPI," Proceedings of 12th European Parallel Virtual Machine and Message Passing Interface Conference - Euro PVM/MPI, Di Martino, B. et al. eds. Springer-Verlag Berlin, Sorrento (Naples), Italy, LCNS 3666, pp. 76-83, September 18-21, 2005 [pdf] [bibtex]
Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G., Gabriel, E., Dongarra, J. "Performance Analysis of MPI Collective Operations," 4th International Workshop on Performance Modeling, Evaluation, and Optmization of Parallel and Distributed Systems (PMEO-PDS '05), Denver, Colorado, April 4-8, 2005 [pdf] [bibtex]
Chen, Z., Fagg, G., Gabriel, E., Langou, J., Angskun, T., Bosilca, G., Dongarra, J. "Fault Tolerant High Performance Computing by a Coding Approach," Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (to appear), Chicago, Illinois, June 15-17, 2005 [pdf] [bibtex]
Pjesivac-Grbovic, J., Angskun, Bosilca, G., Fagg, G., Gabriel, E., Dongarra, J. "Performance Analysis of MPI Collective Operations," Cluster Computing Journal (to appear), 2006 [pdf] [bibtex]
Bosilca, G., Chen, Z., Dongarra, J., Langou, J. "Recovery Patterns for Iterative Methods in a Parallel Unstable Environment," University of Tennessee Computer Science Department Technical Report, UT-CS-04-538, 2005 [pdf] [bibtex]
Fagg, G., Gabriel, E., Bosilca, G., Angskun, T., Chen, Z., Pjesivac-Grbovic, J., London, K., Dongarra, J. "Extending the MPI Specification for Process Fault Tolerance on High Performance Computing Systems," Proceedings of ISC2004 (to appear), Heidelberg, Germany, June 23, 2004 [pdf] [bibtex]
Fagg, G., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Pjesivac-Grbovic, J., Dongarra, J. "Process Fault-Tolerance: Semantics, Design and Applications for High Performance Computing," International Journal for High Performance Applications and Supercomputing (to appear), April, 2004 [pdf] [bibtex]
Bosilca, G., Chen, Z., Dongarra, J., Langou, J. "Recovery Patterns for Iterative Methods in a Parallel Unstable Environment," ICL Technical Report, ICL-UT-04-04, 2004 [pdf] [bibtex]
Fagg, G., Gabriel, E., Chen, Z., Angskun, T., Bosilca, G., Bukovsky, A., Dongarra, J. "Fault Tolerant Communication Library and Applications for High Performance Computing," Los Alamos Computer Science Institute (LACSI) Symposium 2003 (presented), Santa Fe, NM, October 27-29, 2003 [pdf] [bibtex]