CUDA out of memory during inference with SQL-R1-14B

Hello!! I tried to inference SQL-R1-14B on 1*A100 80GB. No matter I set the gpu_memory_utilization=0.9\0.8\0.5, I always get the CUDA out of memory error. The SQL-R1-3B and 7B have all been successfully ran on my device. Besides, I can also run other models around 14B on my device. Do you have any ideas about this error? Thanks :)

Here is the config log of vLLM:
```
Initializing an LLM engine (vdev) with config: model='MPX0222forHF/SQL-R1-14B', speculative_config=None, tokenizer='MPX0222forHF/SQL-R1-14B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=MPX0222forHF/SQL-R1-14B, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
```

Here is the error message:
```
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 236.69 MiB is free. Process 2457479 has 60.40 GiB memory in use. Including non-PyTorch memory, this process has 18.51 GiB memory in use. Of the allocated memory 18.01 GiB is allocated by PyTorch, and 12.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA out of memory during inference with SQL-R1-14B #29

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA out of memory during inference with SQL-R1-14B #29

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions