A100 makes too large check point and slow learning

puggykk · December 3, 2023, 1:12pm

System Info

transformers 4.36.0
Python 3.8.10
Driver Version: 535.104.12
CUDA Version: 12.2

I’ve run a Audio classification code in 4090, H100, A100.
In 4090, H100 Code ran well and not large check point , also speed was fast

When it comes to run in 4*A100,
It made too large check point and was too slow ; I thouth it is due to No pararell process code
So , I made a container with a A100 GPU
But it caused also same problems

Here is my code and data link

Can you help me using A100?

Data
https://drive.google.com/file/d/1tKNgHiy-b9_oL8hWG4vpDKePqAv8GHNc/view?usp=drive_link
Code
https://drive.google.com/file/d/1zU0UziwtI8SN7PJD35E7-Tg1NSKnYbSr/view?usp=drive_link

Topic		Replies	Views
Confusing Benchmark results Running whisper on 4080 Super vs A10 vs H100 🤗Transformers	0	545	April 22, 2024
Gpt-oss training on A100 - OOM error Beginners	9	104	November 27, 2025
Wav2vec fine-tuning with multiGPU Models	16	7059	May 22, 2021
GPTBigCode gives garbled output on Nvidia A10G 🤗Accelerate	1	54	August 5, 2024
Training of GPT hang during Checkpoint stage 🤗Transformers	0	144	January 23, 2024

A100 makes too large check point and slow learning

System Info

Related topics