[1] | J. Dongarra and P. e. a. Beckman, “The International Exascale Software Roadmap,” International Journal of High Performance Computer Applications, vol. 25, no. 1, 2011. |
[2] | K. Z. Ibrahim, F. Bodin, and O. P`ene, “Fine-Grained Parallelization Of Lattice Qcd Kernel Routine On GPUs,” J. Parallel Distrib. Comput., vol. 68, October 2008. |
[3] | A. Rahimian, I. Lashuk, S. Veerapaneni, A. Chandramowlishwaran, D. Malhotra, L. Moon, R. Sampath, A. Shringarpure, J. Vetter, R. Vuduc, D. Zorin, and G. Biros, “Petascale Direct Numerical Simulation Of Blood Flow On 200k Cores and Heterogeneous Architectures,” in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010. |
[4] | Y. Zhuo, X-L. Wu, J. P. Haldar, W.-m. Hwu, Z.-p. Liang, and B. P. Sutton, “Accelerating Iterative Field-Compensated MR Image Reconstruction on GPUs,” in Proceedings of the 2010 IEEE international conference on Biomedical imaging: from Nano to Macro, ser. ISBI’10, 2010. |
[5] | N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli, “High Performance Discrete Fourier Transforms On Graphics Processors,” in SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA: IEEE Press, 2008, pp. 1–12. |
[6] | B. Jang, S. Do, H. Pien, and D. Kaeli, “Architecture-Aware Optimization Targeting Multithreaded Stream Computing,” in Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, 2009. |
[7] | M. Lam, E. Rothberg, and M. E. Wolf, “The Cache Performance and Optimizations Of Blocked Algorithms,” inProceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IV), Santa Clara, CA, Apr. 1991. |
[8] | D. Callahan, S. Carr, and K. Kennedy, “Improving Register Allocation for Subscripted Variables,” in Proceedings of the SIGPLAN ’90 Conference on Programming Language Design and Implementation, White Plains, NY, Jun. 1990. |
[9] | S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu, “Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA,” in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, 2008. |
[10] | V. Volkov and J. W. Demmel, “Benchmarking GPUs To Tune Dense Linear Algebra,” in SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008. |
[11] | R. D. E. Petit, F. Bodin, “An Hybrid Data Transfer Optimization for GPU,” in Compilers for Parallel Computers (CPC2007), 2007. |
[12] | B. Salpitikorala, A. Chauhan, and G. Fox, “Optimizing OpenCL Kernels for Iterative Statistical Algorithms on GPUs,” in In Proceedings of the Second International Workshop on GPUs and Scientific Applications (GPUScA), 2011. |
[13] | M. B. G. Murthy, M. Ravishankar and P. Sadayappan, “Optimal Loop Unrolling for GPGPU Programs,” in IEEE International Symposium on Parallel Distributed Processing, 2010. |
[14] | L. Yixun, E. Z. Zhang, and X. Shen, “A Cross-Input Adaptive Framework for GPU Program Optimizations,” in Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing, 2009. |
[15] | J. W. Choi, A. Singh, and R. W. Vuduc, “Model-Driven Autotuning Of Sparse Matrix-Vector Multiply On GPUs,” in PPoPP ’10: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming. New York, NY, USA: ACM, 2010, pp. 115–126. |
[16] | S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization Of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms,” Parallel Computers, vol. 35, no. 3, pp. 178–194, 2009. |
[17] | A. Nukada and S. Matsuoka, “Autotuning 3-D FFT Library for CUDA GPUs,” in SC ’09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. New York, NY, USA: ACM, 2009, pp. 1–10. |
[18] | N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha, “A Memory Model for Scientific Algorithms On Graphics Processors,” in SC ’06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM, 2006, p. 89. |
[19] | K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick, “Stencil Computation Optimization and Auto-Tuning On State-Of-The-Art Multicore Architectures,” in SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA: IEEE Press, 2008, pp. 1–12. |
[20] | R. Nath, S. Tomov, and J. Dongarra, “Accelerating GPU Kernels for Dense Linear Algebra,” In Proceedings of 9th International Meeting on High Performance Computing for Computational Science (VECPAR’10), 2010. |
[21] | S. Grauer-Gray and J. Cavazos, “Optimizing and Autotuning Belief Propagation On The GPU,” in Proceedings of the 23rd international conference on Languages and compilers for parallel computing, ser. LCPC’10, 2011, pp. 121–135. |
[22] | M. M. Baskaran, J. Ramanujam, and P. Sadayappan, “Automatic C-To-CUDA Code Generation for Affine Programs.” in Lecture Notes in Computer Science, R. Gupta, Ed., vol. 6011. Springer, 2010, pp. 244–263. |
[23] | S. Lee, S.-J. Min, and R. Eigenmann, “OpenMP To GPGPU: A Compiler Framework for Automatic Translation and Optimization,” in Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2009. |
[24] | E. Petit, F. Bodin, G. Papaure, and F. Dru, “Astex: A Hot Path Based Thread Extractor for Distributed Memory System on a Chip,” in Proceedings of the 2006 ACM/IEEE conference on Supercomputing, 2006. |
[25] | C. -Y. Shei, P. Ratnalikar, and A. Chauhan, “Automating GPU Computing in Matlab,” in Proceedings of the international conference on Supercomputing, ICS11, 2011. |
[26] | R. Allen and K. Kennedy, “Optimizing Compilers for Modern Architectures”, Morgan Kaufmann, 2002. |
[27] | R. Cytron and J. Ferrante, “What’s In A Name? -Or- The Value Of Renaming for Parallelism Detection and Storage Allocation”, in ICPP’87, 1987, pp. 19–27. |
[28] | P. Briggs and K. D. Cooper, “Effective Partial Redundancy Elimination,” in Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation, ser. PLDI ’94, 1994. |
[29] | “CUDA PTX ISA,” http://www.nvidia.com |
[30] | S. Carr and K. Kennedy, “Improving The Ratio Of Memory Operations To Floating-Point Operations In Loops,” ACM Transactions on Programming Languages and Systems, vol. 16, no. 6, pp. 1768–1810, 1994. |
[31] | V. Volkov, “Better Performance At Lower Occupancy,” Supercomputing Tutorial, 2010. |
[32] | CUDA Programming Guide, Version 3.0. NVIDIA, http://www.nvidia.com, 2010 |
[33] | L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent, “HPCtoolkit: Tools for Performance Analysis Of Optimized Parallel Programs,” Concurrency and Computation: Practice and Experience, To Appear, 2009. |
[34] | S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci, “A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters,” in Supercomputing, ACM/IEEE 2000 Conference, Nov. 2000. |
[35] | Q. Yi and A. Qasem, “Exploring The Optimization Space Of Dense Linear Algebra Kernels,” in Languages and compilers for parallel computing, LCPC08, Aug. 2008. |
[36] | “Stencilprobe: A Microbenchmark for Stencil Applications.” http://www.cs.berkeley.edu/skamil/ |
[37] | A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, and R. Vuduc, “Optimizing and Tuning The Fast Multipole Method for State-Of-The-Art Multicore Architectures,” in IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2010. |