Distributed Deep Learning: Performance Characterization and Modeling

Training deep learning models often require learning from a large amount of datasets and using a cluster of servers with expensive hardware accelerators such as GPUs. Our project, referred to as Cornucopia, focuses on improving the resource utilization and job response time for distributed deep learning—an emerging datacenter workload. Our work explores the promise of model-specific training speedup, combining both systems and machine learning optimizations. Our current work includes more informed neural architecture searches, and distributed training performance characterization and modeling.

Acknowledgements: This project is generously supported by NSF Grant #1755659 and Google Cloud Research Credits.

Project Personnel

Papers

Few-shot Neural Architecture Search.

Yiyang Zhao, Linnan Wang, Yuandong Tian, Rodrigo Fonseca, and Tian Guo.

arXiv:2006.06863


Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers.

Shijian Li, Robert Walls and Tian Guo.

40th IEEE International Conference on Distributed Computing Systems (ICDCS'20)


Speeding up Deep Learning with Transient Servers.

Shijian Li, Robert Walls, Lijie Xu and Tian Guo.

The 16th IEEE International Conference on Autonomic Computing (ICAC'19)

© Tian Guo 2020. Last Updated: Jul 28, 2020