Training deep learning models often require learning from a large amount of datasets and using a cluster of servers with expensive hardware accelerators such as GPUs. Our project, referred to as Cornucopia, focuses on improving the resource utilization and job response time for distributed deep learning—an emerging datacenter workload. Our work explores the promise of model-specific training speedup, combining both systems and machine learning optimizations. Our current work includes more informed neural architecture searches, and distributed training performance characterization and modeling.
Acknowledgements: This project is generously supported by NSF Grant #1755659 and Google Cloud Research Credits.
Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers.
Shijian Li, Robert Walls and Tian Guo.
40th IEEE International Conference on Distributed Computing Systems (ICDCS'20)
Speeding up Deep Learning with Transient Servers.
Shijian Li, Robert Walls, Lijie Xu and Tian Guo.
The 16th IEEE International Conference on Autonomic Computing (ICAC'19)