Frontier, the world's largest supercomputer, uses 3,072 AMD GPUs to train over a trillion parameter LLMs

2024-01-15 07:17:20

Bit News According to a report by New Zhiyuan on January 13, AMD's software and hardware systems can also train GPT-3.5 level large models.

Frontier, the world's largest supercomputer at Oak Ridge National Laboratory, is home to 37,888 MI250X GPUs and 9,472 Epyc7A53CPUs. Recently, researchers trained a GPT-3.5-scale model using only about 8% of those GPUs. The researchers successfully used the ROCM software platform to successfully break through many difficulties of distributed training models on AMD hardware, and established the most advanced distributed training algorithm and framework for large models on AMD hardware using the ROCM platform.

Successfully provides a feasible technical framework for efficient training of LLMs on non-NVIDIA and non-CUDA platforms.

After the training, the researchers summarized the experience of training large models on Frontier into a paper detailing the challenges encountered and overcome.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes

Reward
1
Comment
Repost
Share

Comment

0/400

No comments

Topic
#July PPI Beats Expectations
11k Popularity
#ETH ETFs Top $30B
12k Popularity
#Gate Alpha Peak Trading Competition
137k Popularity
#Bessent on BTC Reserves
875 Popularity
#Gate Releases August Reserves Report
20k Popularity

sitemap