OOM with 8×A6000 (48GB): which hyperparameters to reduce memory while preserving results?

Hi, thanks for the interesting paper and for releasing such a nice codebase.

I’m trying to run the codebase on 8× NVIDIA A6000 (48GB) GPUs, but I’m consistently hitting CUDA out-of-memory (OOM) errors. Do you have recommendations on which hyperparameters are the most effective to tune to reduce GPU memory usage while preserving results as much as possible?

For example:
* number of frames
* number of rollouts
* completion length (generation length)
* per device batch size
* any other memory-critical settings you recommend adjusting first

If there are known “safe” ranges for smaller GPU budgets, that would be very helpful as well.

I also have follow-up questions:
1. Are the Table 15 results from different training results with a varying number of frames? (e.g., 64 frame training -> 64 frame inference)
2. Is `total batch size` = `num_gpus` * `per_device_train_batch_size` * `steps_per_generation`?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM with 8×A6000 (48GB): which hyperparameters to reduce memory while preserving results? #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM with 8×A6000 (48GB): which hyperparameters to reduce memory while preserving results? #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions