Distributed Deep Learning: Performance Characterization and Modeling

Training deep learning models often require learning from a large amount of datasets and using a cluster of servers with expensive hardware accelerators such as GPUs. Our project, referred to as Cornucopia, focuses on improving the resource utilization and job response time for distributed deep learning—an emerging datacenter workload. Our work explores the promise of model-specific training speedup, combining both systems and machine learning optimizations. Our current work includes more informed neural architecture searches, and distributed training performance characterization and modeling.

Acknowledgements: This project is generously supported by NSF Grant #1755659 and Google Cloud Research Credits.

Project Personnel

Tian Guo

Shijian Li

Robert Walls

Lijie Xu

Yiyang Zhao

Guin Gilman

Sam Ogden

Papers

Few-shot Neural Architecture Search.

Yiyang Zhao, Linnan Wang, Yuandong Tian, Rodrigo Fonseca, and Tian Guo.

Thirty-eighth International Conference on Machine Learning (ICML'21)

@InProceedings{fewshot_icml21,
  title = 	 {Few-shot Neural Architecture Search},
  author =       {Zhao, Yiyang and Wang, Linnan and Tian, Yuandong and Fonseca, Rodrigo and Guo, Tian },
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  year = 	 {2021},
}

Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning.

Shijian Li, Oren Mangoubi, Lijie Xu, Tian Guo.

41th IEEE International Conference on Distributed Computing Systems (ICDCS'21)

@INPROCEEDINGS{li2021syncswitch_icdcs,
  author={Li, Shijian and Mangoubi, Oren and Xu, Lijie and Guo, Tian},
  booktitle={2021 IEEE 41th International Conference on Distributed Computing Systems (ICDCS)},
  title={Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning},
  year={2021},
}

Demystifying the Placement Policies of the GPU Thread Block Scheduler for Concurrent Kernels.

Guin Gilman, Samuel S. Ogden, Tian Guo and Robert J. Walls.

38th International Symposium on Computer Performance, Modeling, Measurements and Evaluation (Performance'20)

@inproceedings{gpu_perf2020,
  author    = {Guin Gilman and Samuel S. Ogden and Tian Guo and Robert J. Walls},
  title     = {Demystifying the Placement Policies of the GPU Thread Block Scheduler for Concurrent Kernels},
  booktitle = {38th International Symposium on Computer Performance, Modeling, Measurements and Evaluation, {Performance} 2020},
  year      = {2020},
}

Few-shot Neural Architecture Search.

Yiyang Zhao, Linnan Wang, Yuandong Tian, Rodrigo Fonseca, and Tian Guo.

arXiv:2006.06863

@misc{zhao2020fewshot,
  title={Few-shot Neural Architecture Search},
  author={Yiyang Zhao and Linnan Wang and Yuandong Tian and Rodrigo Fonseca and Tian Guo},
  year={2020},
  eprint={2006.06863},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers.

Shijian Li, Robert Walls and Tian Guo.

40th IEEE International Conference on Distributed Computing Systems (ICDCS'20)

@inproceedings{cmdare_icdcs2020,
  author    = {Shijian Li and
               Robert J. Walls and
               Tian Guo},
  title     = {Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers},
  booktitle = {40th {IEEE} International Conference on Distributed Computing Systems, {ICDCS} 2020},
  year      = {2019},
}

Speeding up Deep Learning with Transient Servers.

Shijian Li, Robert Walls, Lijie Xu and Tian Guo.

The 16th IEEE International Conference on Autonomic Computing (ICAC'19)

@inproceedings{transient_training_icac2019,
  author    = {Shijian Li and
               Robert J. Walls and
               Lijie Xu and
               Tian Guo},
  title     = {Speeding up Deep Learning with Transient Servers},
  booktitle = {2019 {IEEE} International Conference on Autonomic Computing, {ICAC}
               2019, Ume{\aa}, Sweden, June 16-20, 2019},
  pages     = {125--135},
  publisher = {{IEEE}},
  year      = {2019},
  url       = {https://doi.org/10.1109/ICAC.2019.00024},
  doi       = {10.1109/ICAC.2019.00024},
  timestamp = {Wed, 16 Oct 2019 14:14:48 +0200},
  biburl    = {https://dblp.org/rec/conf/icac/LiWX019.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}