Skip to content

CUDA out of memory during inference with SQL-R1-14B #29

@thatmee

Description

@thatmee

Hello!! I tried to inference SQL-R1-14B on 1*A100 80GB. No matter I set the gpu_memory_utilization=0.9\0.8\0.5, I always get the CUDA out of memory error. The SQL-R1-3B and 7B have all been successfully ran on my device. Besides, I can also run other models around 14B on my device. Do you have any ideas about this error? Thanks :)

Here is the config log of vLLM:

Initializing an LLM engine (vdev) with config: model='MPX0222forHF/SQL-R1-14B', speculative_config=None, tokenizer='MPX0222forHF/SQL-R1-14B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=MPX0222forHF/SQL-R1-14B, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)

Here is the error message:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 236.69 MiB is free. Process 2457479 has 60.40 GiB memory in use. Including non-PyTorch memory, this process has 18.51 GiB memory in use. Of the allocated memory 18.01 GiB is allocated by PyTorch, and 12.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions