Frontier, the world's largest supercomputer, uses 3,072 AMD GPUs to train over a trillion parameter LLMs

Bit News According to a report by New Zhiyuan on January 13, AMD's software and hardware systems can also train GPT-3.5 level large models.

Frontier, the world's largest supercomputer at Oak Ridge National Laboratory, is home to 37,888 MI250X GPUs and 9,472 Epyc7A53CPUs. Recently, researchers trained a GPT-3.5-scale model using only about 8% of those GPUs. The researchers successfully used the ROCM software platform to successfully break through many difficulties of distributed training models on AMD hardware, and established the most advanced distributed training algorithm and framework for large models on AMD hardware using the ROCM platform.

Successfully provides a feasible technical framework for efficient training of LLMs on non-NVIDIA and non-CUDA platforms.

After the training, the researchers summarized the experience of training large models on Frontier into a paper detailing the challenges encountered and overcome.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate app
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)