From af4ea6b7b85d595d11e0846fed08ce0b77c577d1 Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Tue, 6 Jan 2026 20:43:56 -0500 Subject: [PATCH 01/15] Add documentation for Ollama command line tool Added comprehensive documentation for Ollama, a command line tool for running large language models, including installation instructions, environment variables, and usage examples. --- docs/hpc/08_ml_ai_hpc/07_ollama.md | 71 ++++++++++++++++++++++++++++++ 1 file changed, 71 insertions(+) create mode 100644 docs/hpc/08_ml_ai_hpc/07_ollama.md diff --git a/docs/hpc/08_ml_ai_hpc/07_ollama.md b/docs/hpc/08_ml_ai_hpc/07_ollama.md new file mode 100644 index 0000000000..91ac95a835 --- /dev/null +++ b/docs/hpc/08_ml_ai_hpc/07_ollama.md @@ -0,0 +1,71 @@ +# Ollama - A Command Line LLM Tool +## What is Ollama? +[Ollama](https://github.com/ollama/ollama) is a developing command line tool designed to run large language models. +Ollama Installation Instructions +Create an Ollama directory, such as in your /scratch or /vast directories, then download the ollama files: +``` +curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz +tar -vxzf ollama-linux-amd64.tgz +``` +### Use VAST Storage for Best Performance +There are several environment variables that can be changed: +``` +ollama serve --help +#Environment Variables: +#OLLAMA_HOST The host:port to bind to (default "127.0.0.1:11434") +#OLLAMA_ORIGINS A comma separated list of allowed origins. +#OLLAMA_MODELS The path to the models directory (default is "~/.ollama/models") +#OLLAMA_KEEP_ALIVE The duration that models stay loaded in memory (default is "5m") +``` +LLMs require very fast storage. The fastest storage on the HPC clusters is the currently the all-flash VAST storage service. This storage is designed for AI workloads and can greatly speed up performance. You should change your model download directory accordingly: +``` +export OLLAMA_MODELS=$VAST/ollama_models +``` +You should run this to configure ollama to always use your VAST storage for consistent use: +``` +echo "export OLLAMA_MODELS=$VAST/ollama_models" >> ~/.bashrc file +``` + +## Run Ollama +### Batch Style Jobs +You can run ollama on a random port: +``` +export OLPORT=$(python3 -c "import socket; sock=socket.socket(); sock.bind(('',0)); print(sock.getsockname()[1])") +OLLAMA_HOST=127.0.0.1:$OLPORT ./bin/ollama serve +``` +You can use the above as part of a Slurm batch job like the example below: +``` +#!/bin/bash +#SBATCH --job-name=ollama +#SBATCH --output=ollama_%j.log +#SBATCH --ntasks=1 +#SBATCH --mem=8gb +#SBATCH --gres=gpu:a100:1 +#SBATCH --time=01:00:00 + +export OLPORT=$(python3 -c "import socket; sock=socket.socket(); sock.bind(('',0)); print(sock.getsockname()[1])") +export OLLAMA_HOST=127.0.0.1:$OLPORT + +./bin/ollama serve > ollama-server.log 2>&1 && +wait 10 +./bin/ollama pull mistral +python my_ollama_python_script.py >> my_ollama_output.txt +``` +In the above example, your python script will be able to talk to the ollama server. +### Interactive Ollama Sessions +If you want to run Ollama and chat with it, open a Desktop session on a GPU node via Open Ondemand (https://ood.hpc.nyu.edu/) and launch two terminals, one to start the ollama server and the other to chat with LLMs. +**In Terminal 1:** +Start ollama +``` +export OLPORT=$(python3 -c "import socket; sock=socket.socket(); sock.bind(('',0)); print(sock.getsockname()[1])") +echo $OLPORT #so you know what port Ollama is running on +OLLAMA_HOST=127.0.0.1:$OLPORT ./bin/ollama serve +``` +**In Terminal 2:** +Pull a model and begin chatting +``` +export OLLAMA_HOST=127.0.0.1:$OLPORT +./bin/ollama pull llama3.2 +./bin/ollama run llama3.2 +``` +Note that you may have to redefine OLPORT in the second terminal, if you do, make sure you manually set it to the same port as the other terminal window. From 652a3c81c32120ad28b344bd130367438be2cb0d Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Tue, 6 Jan 2026 20:45:30 -0500 Subject: [PATCH 02/15] Add installation instructions for Ollama --- docs/hpc/08_ml_ai_hpc/07_ollama.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/hpc/08_ml_ai_hpc/07_ollama.md b/docs/hpc/08_ml_ai_hpc/07_ollama.md index 91ac95a835..6ad3576798 100644 --- a/docs/hpc/08_ml_ai_hpc/07_ollama.md +++ b/docs/hpc/08_ml_ai_hpc/07_ollama.md @@ -1,7 +1,8 @@ # Ollama - A Command Line LLM Tool ## What is Ollama? [Ollama](https://github.com/ollama/ollama) is a developing command line tool designed to run large language models. -Ollama Installation Instructions + +## Ollama Installation Instructions Create an Ollama directory, such as in your /scratch or /vast directories, then download the ollama files: ``` curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz From d68bf3c9fe43eb362b6576ab1894e26a1e5b357e Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Tue, 6 Jan 2026 20:46:34 -0500 Subject: [PATCH 03/15] Document interactive Ollama sessions setup Added instructions for starting interactive Ollama sessions on a GPU node. --- docs/hpc/08_ml_ai_hpc/07_ollama.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/hpc/08_ml_ai_hpc/07_ollama.md b/docs/hpc/08_ml_ai_hpc/07_ollama.md index 6ad3576798..268cfc543b 100644 --- a/docs/hpc/08_ml_ai_hpc/07_ollama.md +++ b/docs/hpc/08_ml_ai_hpc/07_ollama.md @@ -55,6 +55,7 @@ python my_ollama_python_script.py >> my_ollama_output.txt In the above example, your python script will be able to talk to the ollama server. ### Interactive Ollama Sessions If you want to run Ollama and chat with it, open a Desktop session on a GPU node via Open Ondemand (https://ood.hpc.nyu.edu/) and launch two terminals, one to start the ollama server and the other to chat with LLMs. + **In Terminal 1:** Start ollama ``` From 2928885feeaaac4e645e166c43cd74ca9c038e7f Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Tue, 6 Jan 2026 20:47:01 -0500 Subject: [PATCH 04/15] Update interactive Ollama session instructions Added instructions for starting Ollama server and chatting with LLMs in interactive sessions. --- docs/hpc/08_ml_ai_hpc/07_ollama.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/hpc/08_ml_ai_hpc/07_ollama.md b/docs/hpc/08_ml_ai_hpc/07_ollama.md index 268cfc543b..eed1dbddbd 100644 --- a/docs/hpc/08_ml_ai_hpc/07_ollama.md +++ b/docs/hpc/08_ml_ai_hpc/07_ollama.md @@ -57,6 +57,7 @@ In the above example, your python script will be able to talk to the ollama serv If you want to run Ollama and chat with it, open a Desktop session on a GPU node via Open Ondemand (https://ood.hpc.nyu.edu/) and launch two terminals, one to start the ollama server and the other to chat with LLMs. **In Terminal 1:** + Start ollama ``` export OLPORT=$(python3 -c "import socket; sock=socket.socket(); sock.bind(('',0)); print(sock.getsockname()[1])") @@ -64,6 +65,7 @@ echo $OLPORT #so you know what port Ollama is running on OLLAMA_HOST=127.0.0.1:$OLPORT ./bin/ollama serve ``` **In Terminal 2:** + Pull a model and begin chatting ``` export OLLAMA_HOST=127.0.0.1:$OLPORT From e3ac668962450b67d2828f9d065894c77fae7d22 Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Wed, 7 Jan 2026 17:57:04 -0500 Subject: [PATCH 05/15] Revise Ollama installation and storage guidance Updated installation instructions and storage recommendations for Ollama. --- docs/hpc/08_ml_ai_hpc/07_ollama.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/hpc/08_ml_ai_hpc/07_ollama.md b/docs/hpc/08_ml_ai_hpc/07_ollama.md index eed1dbddbd..f0583b6653 100644 --- a/docs/hpc/08_ml_ai_hpc/07_ollama.md +++ b/docs/hpc/08_ml_ai_hpc/07_ollama.md @@ -3,12 +3,12 @@ [Ollama](https://github.com/ollama/ollama) is a developing command line tool designed to run large language models. ## Ollama Installation Instructions -Create an Ollama directory, such as in your /scratch or /vast directories, then download the ollama files: +Create an Ollama directory in your /scratch directories, then download the ollama files: ``` curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz tar -vxzf ollama-linux-amd64.tgz ``` -### Use VAST Storage for Best Performance +### Use High-Performance SCRATCH Storage There are several environment variables that can be changed: ``` ollama serve --help @@ -18,13 +18,13 @@ ollama serve --help #OLLAMA_MODELS The path to the models directory (default is "~/.ollama/models") #OLLAMA_KEEP_ALIVE The duration that models stay loaded in memory (default is "5m") ``` -LLMs require very fast storage. The fastest storage on the HPC clusters is the currently the all-flash VAST storage service. This storage is designed for AI workloads and can greatly speed up performance. You should change your model download directory accordingly: +LLMs require very fast storage. On Torch, the SCRATCH filesystem is an all-flash system designed for AI workloads, providing excellent performance. You should change your model download directory to your scratch space: ``` -export OLLAMA_MODELS=$VAST/ollama_models +export OLLAMA_MODELS=/scratch/$USER/ollama_models ``` -You should run this to configure ollama to always use your VAST storage for consistent use: +You should run this to configure ollama to always use your SCRATCH storage for consistent use: ``` -echo "export OLLAMA_MODELS=$VAST/ollama_models" >> ~/.bashrc file +echo "export OLLAMA_MODELS=/scratch/$USER/ollama_models" >> ~/.bashrc ``` ## Run Ollama From 787122de3f3ae9368fc91c39faa3830d8e3fc133 Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Wed, 7 Jan 2026 18:23:55 -0500 Subject: [PATCH 06/15] Fix typo in Ollama installation instructions --- docs/hpc/08_ml_ai_hpc/07_ollama.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hpc/08_ml_ai_hpc/07_ollama.md b/docs/hpc/08_ml_ai_hpc/07_ollama.md index f0583b6653..2fd4596fec 100644 --- a/docs/hpc/08_ml_ai_hpc/07_ollama.md +++ b/docs/hpc/08_ml_ai_hpc/07_ollama.md @@ -3,7 +3,7 @@ [Ollama](https://github.com/ollama/ollama) is a developing command line tool designed to run large language models. ## Ollama Installation Instructions -Create an Ollama directory in your /scratch directories, then download the ollama files: +Create an Ollama directory in your /scratch directory, then download the ollama files: ``` curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz tar -vxzf ollama-linux-amd64.tgz From 9cdd5bb937213a9e2d56bd4fb291f2113f26f374 Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Mon, 26 Jan 2026 17:35:40 -0500 Subject: [PATCH 07/15] Rename Ollama to vLLM in documentation Updated the document to reflect the correct name and description of the tool from 'Ollama' to 'vLLM'. --- docs/hpc/08_ml_ai_hpc/07_ollama.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/hpc/08_ml_ai_hpc/07_ollama.md b/docs/hpc/08_ml_ai_hpc/07_ollama.md index 2fd4596fec..73332eb35a 100644 --- a/docs/hpc/08_ml_ai_hpc/07_ollama.md +++ b/docs/hpc/08_ml_ai_hpc/07_ollama.md @@ -1,7 +1,6 @@ -# Ollama - A Command Line LLM Tool +# vLLM - A Command Line LLM Tool ## What is Ollama? -[Ollama](https://github.com/ollama/ollama) is a developing command line tool designed to run large language models. - +[vLLM](https://docs.vllm.ai/en/latest/) is a fast and easy-to-use library for LLM inference and serving. ## Ollama Installation Instructions Create an Ollama directory in your /scratch directory, then download the ollama files: ``` From 0ceb548c5c2108978fc2ff91d02bec17c7ba0fe8 Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Mon, 26 Jan 2026 17:36:19 -0500 Subject: [PATCH 08/15] Rename section from 'Ollama' to 'vLLM' --- docs/hpc/08_ml_ai_hpc/07_ollama.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hpc/08_ml_ai_hpc/07_ollama.md b/docs/hpc/08_ml_ai_hpc/07_ollama.md index 73332eb35a..32d656aa50 100644 --- a/docs/hpc/08_ml_ai_hpc/07_ollama.md +++ b/docs/hpc/08_ml_ai_hpc/07_ollama.md @@ -1,5 +1,5 @@ # vLLM - A Command Line LLM Tool -## What is Ollama? +## What is vLLM? [vLLM](https://docs.vllm.ai/en/latest/) is a fast and easy-to-use library for LLM inference and serving. ## Ollama Installation Instructions Create an Ollama directory in your /scratch directory, then download the ollama files: From c003c44c31a6c0de9d19a3cd575458439646d225 Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Tue, 27 Jan 2026 01:44:16 -0500 Subject: [PATCH 09/15] documentation half way through --- docs/hpc/08_ml_ai_hpc/07_ollama.md | 73 +++++++++++++++++++++--------- 1 file changed, 51 insertions(+), 22 deletions(-) diff --git a/docs/hpc/08_ml_ai_hpc/07_ollama.md b/docs/hpc/08_ml_ai_hpc/07_ollama.md index 32d656aa50..4341f9e397 100644 --- a/docs/hpc/08_ml_ai_hpc/07_ollama.md +++ b/docs/hpc/08_ml_ai_hpc/07_ollama.md @@ -1,33 +1,27 @@ # vLLM - A Command Line LLM Tool ## What is vLLM? [vLLM](https://docs.vllm.ai/en/latest/) is a fast and easy-to-use library for LLM inference and serving. -## Ollama Installation Instructions -Create an Ollama directory in your /scratch directory, then download the ollama files: +## vLLM Installation Instructions +Create a vLLM directory in your /scratch directory, then install the vLLM image: ``` -curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz -tar -vxzf ollama-linux-amd64.tgz +apptainer pull docker://vllm/vllm-openai:latest ``` ### Use High-Performance SCRATCH Storage -There are several environment variables that can be changed: +LLMs require very fast storage. On Torch, the SCRATCH filesystem is an all-flash system designed for AI workloads, providing excellent performance.To avoid exceeding your $HOME quota (50GB) and inode limits (30,000 files), you should redirect vLLM's cache and Hugging Face's model downloads to your scratch space: ``` -ollama serve --help -#Environment Variables: -#OLLAMA_HOST The host:port to bind to (default "127.0.0.1:11434") -#OLLAMA_ORIGINS A comma separated list of allowed origins. -#OLLAMA_MODELS The path to the models directory (default is "~/.ollama/models") -#OLLAMA_KEEP_ALIVE The duration that models stay loaded in memory (default is "5m") +export HF_HOME=/scratch/$USER/hf_cache +export VLLM_CACHE_ROOT=/scratch/$USER/vllm_cache ``` -LLMs require very fast storage. On Torch, the SCRATCH filesystem is an all-flash system designed for AI workloads, providing excellent performance. You should change your model download directory to your scratch space: +You should run this to configure vLLM to always use your SCRATCH storage for consistent use: ``` -export OLLAMA_MODELS=/scratch/$USER/ollama_models -``` -You should run this to configure ollama to always use your SCRATCH storage for consistent use: -``` -echo "export OLLAMA_MODELS=/scratch/$USER/ollama_models" >> ~/.bashrc +echo "export HF_HOME=/scratch/\$USER/hf_cache" >> ~/.bashrc +echo "export VLLM_CACHE_ROOT=/scratch/\$USER/vllm_cache" >> ~/.bashrc ``` -## Run Ollama -### Batch Style Jobs +Note: Files on $SCRATCH are not backed up and will be deleted after 60 days of inactivity. Always keep your source code and .slurm scripts in $HOME! + +## Run vLLM +### Online Serving You can run ollama on a random port: ``` export OLPORT=$(python3 -c "import socket; sock=socket.socket(); sock.bind(('',0)); print(sock.getsockname()[1])") @@ -52,8 +46,9 @@ wait 10 python my_ollama_python_script.py >> my_ollama_output.txt ``` In the above example, your python script will be able to talk to the ollama server. -### Interactive Ollama Sessions -If you want to run Ollama and chat with it, open a Desktop session on a GPU node via Open Ondemand (https://ood.hpc.nyu.edu/) and launch two terminals, one to start the ollama server and the other to chat with LLMs. + +### Offline Inference +If you need to process a large dataset at once without setting up a server, you can use vLLM's LLM class. **In Terminal 1:** @@ -71,4 +66,38 @@ export OLLAMA_HOST=127.0.0.1:$OLPORT ./bin/ollama pull llama3.2 ./bin/ollama run llama3.2 ``` -Note that you may have to redefine OLPORT in the second terminal, if you do, make sure you manually set it to the same port as the other terminal window. + + +## vLLM CLI +The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with: +``` +vllm --help +``` +Serve - Starts the vLLM OpenAI Compatible API server. +``` +vllm serve meta-llama/Llama-2-7b-hf +``` +Chat - Generate chat completions via the running API server. +``` +# Directly connect to localhost API without arguments +vllm chat + +# Specify API url +vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1 + +# Quick chat with a single prompt +vllm chat --quick "hi" +``` +Complete - Generate text completions based on the given prompt via the running API server. +``` +# Directly connect to localhost API without arguments +vllm complete + +# Specify API url +vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1 + +# Quick complete with a single prompt +vllm complete --quick "The future of AI is" +``` + +For more CLI command references: visit https://docs.vllm.ai/en/stable/cli/. From 3500e7fbc651bd8732e0f212ceb71efb4e6edf28 Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Tue, 27 Jan 2026 16:04:36 -0500 Subject: [PATCH 10/15] Update 07_ollama.md --- docs/hpc/08_ml_ai_hpc/07_ollama.md | 59 ++++++++++++------------------ 1 file changed, 24 insertions(+), 35 deletions(-) diff --git a/docs/hpc/08_ml_ai_hpc/07_ollama.md b/docs/hpc/08_ml_ai_hpc/07_ollama.md index 4341f9e397..330f9719da 100644 --- a/docs/hpc/08_ml_ai_hpc/07_ollama.md +++ b/docs/hpc/08_ml_ai_hpc/07_ollama.md @@ -21,52 +21,41 @@ echo "export VLLM_CACHE_ROOT=/scratch/\$USER/vllm_cache" >> ~/.bashrc Note: Files on $SCRATCH are not backed up and will be deleted after 60 days of inactivity. Always keep your source code and .slurm scripts in $HOME! ## Run vLLM -### Online Serving -You can run ollama on a random port: +### Online Serving (OpenAI-Compatible API) +vLLM implements the OpenAI API protocol, allowing it to be a drop-in replacement for applications using OpenAI's services. By default, it starts the server at http://localhost:8000. You can specify the address with --host and --port arguments. +**In Terminal 1:** +Start vLLM server (In this example we use Qwen model): ``` -export OLPORT=$(python3 -c "import socket; sock=socket.socket(); sock.bind(('',0)); print(sock.getsockname()[1])") -OLLAMA_HOST=127.0.0.1:$OLPORT ./bin/ollama serve +apptainer exec --nv vllm-openai_latest.sif vllm serve "Qwen/Qwen2.5-0.5B-Instruct" ``` -You can use the above as part of a Slurm batch job like the example below: +When you see: ``` -#!/bin/bash -#SBATCH --job-name=ollama -#SBATCH --output=ollama_%j.log -#SBATCH --ntasks=1 -#SBATCH --mem=8gb -#SBATCH --gres=gpu:a100:1 -#SBATCH --time=01:00:00 - -export OLPORT=$(python3 -c "import socket; sock=socket.socket(); sock.bind(('',0)); print(sock.getsockname()[1])") -export OLLAMA_HOST=127.0.0.1:$OLPORT +Application startup complete. +``` +Open another terminal and log in to the same computing node as in terminal 1. -./bin/ollama serve > ollama-server.log 2>&1 && -wait 10 -./bin/ollama pull mistral -python my_ollama_python_script.py >> my_ollama_output.txt +**In Terminal 2** +``` +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen/Qwen2.5-0.5B-Instruct", + "messages": [ + {"role": "user", "content": "Your prompt..."} + ] + }' ``` -In the above example, your python script will be able to talk to the ollama server. ### Offline Inference If you need to process a large dataset at once without setting up a server, you can use vLLM's LLM class. - -**In Terminal 1:** - -Start ollama -``` -export OLPORT=$(python3 -c "import socket; sock=socket.socket(); sock.bind(('',0)); print(sock.getsockname()[1])") -echo $OLPORT #so you know what port Ollama is running on -OLLAMA_HOST=127.0.0.1:$OLPORT ./bin/ollama serve +For example, the following code downloads the facebook/opt-125m model from HuggingFace and runs it in vLLM using the default configuration. ``` -**In Terminal 2:** +from vllm import LLM -Pull a model and begin chatting +# Initialize the vLLM engine. +llm = LLM(model="facebook/opt-125m") ``` -export OLLAMA_HOST=127.0.0.1:$OLPORT -./bin/ollama pull llama3.2 -./bin/ollama run llama3.2 -``` - +After initializing the LLM instance, use the available APIs to perform model inference. ## vLLM CLI The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with: From 5aa4d94f1cee445d2d1ebebda5311edb2515fa10 Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Wed, 28 Jan 2026 17:00:16 -0500 Subject: [PATCH 11/15] Enhance vLLM documentation with performance metrics Added performance comparison of vLLM and llama-cpp on Torch, including throughput and latency metrics. --- docs/hpc/08_ml_ai_hpc/07_ollama.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/hpc/08_ml_ai_hpc/07_ollama.md b/docs/hpc/08_ml_ai_hpc/07_ollama.md index 330f9719da..6a509a43a7 100644 --- a/docs/hpc/08_ml_ai_hpc/07_ollama.md +++ b/docs/hpc/08_ml_ai_hpc/07_ollama.md @@ -1,6 +1,17 @@ # vLLM - A Command Line LLM Tool ## What is vLLM? [vLLM](https://docs.vllm.ai/en/latest/) is a fast and easy-to-use library for LLM inference and serving. + +## Why vLLM? +We tested vLLM and llama-cpp on Torch, and found vLLM performs better on Torch: +Model: Qwen2.5-7B-Instruct +Prompt Tokens:512 +Output Tokens: 256 +|Backend|Peak Throughput|Median Latency(ms)|Recommendation +|-----|-----|-----|-----| +|vLLM|~4689.6|48.0|Best for Batch/Research| +|llama-cpp|~115.0|~280.0|Best for Single User| + ## vLLM Installation Instructions Create a vLLM directory in your /scratch directory, then install the vLLM image: ``` From 58baf5f3ca4bb492eaaa8e86ed44e465a1c6db1c Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Wed, 28 Jan 2026 17:00:36 -0500 Subject: [PATCH 12/15] Fix median latency format for vLLM Updated median latency format for vLLM in the table. --- docs/hpc/08_ml_ai_hpc/07_ollama.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hpc/08_ml_ai_hpc/07_ollama.md b/docs/hpc/08_ml_ai_hpc/07_ollama.md index 6a509a43a7..d0c81a3dcd 100644 --- a/docs/hpc/08_ml_ai_hpc/07_ollama.md +++ b/docs/hpc/08_ml_ai_hpc/07_ollama.md @@ -9,7 +9,7 @@ Prompt Tokens:512 Output Tokens: 256 |Backend|Peak Throughput|Median Latency(ms)|Recommendation |-----|-----|-----|-----| -|vLLM|~4689.6|48.0|Best for Batch/Research| +|vLLM|~4689.6|~48.0|Best for Batch/Research| |llama-cpp|~115.0|~280.0|Best for Single User| ## vLLM Installation Instructions From df3fc5c55b75f54cf34584c959a6cea81a0d9f8c Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Fri, 30 Jan 2026 08:45:18 -0500 Subject: [PATCH 13/15] Rename 07_ollama.md to 07_vLLM.md --- docs/hpc/08_ml_ai_hpc/{07_ollama.md => 07_vLLM.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/hpc/08_ml_ai_hpc/{07_ollama.md => 07_vLLM.md} (100%) diff --git a/docs/hpc/08_ml_ai_hpc/07_ollama.md b/docs/hpc/08_ml_ai_hpc/07_vLLM.md similarity index 100% rename from docs/hpc/08_ml_ai_hpc/07_ollama.md rename to docs/hpc/08_ml_ai_hpc/07_vLLM.md From 2583f65fe90548609d5bd6bb61731078547fb8fe Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Wed, 4 Feb 2026 17:04:08 -0500 Subject: [PATCH 14/15] Document SGLang for offline batch inference Added section on SGLang for offline batch inference and linked to documentation. --- docs/hpc/08_ml_ai_hpc/07_vLLM.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/hpc/08_ml_ai_hpc/07_vLLM.md b/docs/hpc/08_ml_ai_hpc/07_vLLM.md index d0c81a3dcd..dd01a3fe2d 100644 --- a/docs/hpc/08_ml_ai_hpc/07_vLLM.md +++ b/docs/hpc/08_ml_ai_hpc/07_vLLM.md @@ -68,6 +68,12 @@ llm = LLM(model="facebook/opt-125m") ``` After initializing the LLM instance, use the available APIs to perform model inference. +### SGLang: A Simple Option for Offline Batch Inference +For cases where users only want to run batch inference and do not need an HTTP endpoint, SGLang provides a much simpler offline engine API compared to running a full vLLM server. It is particularly suitable for dataset processing, evaluation pipelines, and one-off large-scale inference jobs. +For more details and examples, see the official SGLang offline engine documentation: +https://docs.sglang.io/basic_usage/offline_engine_api.html + + ## vLLM CLI The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with: ``` From c0d822d284b3d82bc81e22254a9d3d1518dce969 Mon Sep 17 00:00:00 2001 From: Amanda-dong <159391549+Amanda-dong@users.noreply.github.com> Date: Wed, 4 Feb 2026 19:48:40 -0500 Subject: [PATCH 15/15] Clarify SGLang section title in vLLM documentation Updated section title for clarity and added context about SGLang. --- docs/hpc/08_ml_ai_hpc/07_vLLM.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hpc/08_ml_ai_hpc/07_vLLM.md b/docs/hpc/08_ml_ai_hpc/07_vLLM.md index dd01a3fe2d..cdc2434da8 100644 --- a/docs/hpc/08_ml_ai_hpc/07_vLLM.md +++ b/docs/hpc/08_ml_ai_hpc/07_vLLM.md @@ -68,7 +68,7 @@ llm = LLM(model="facebook/opt-125m") ``` After initializing the LLM instance, use the available APIs to perform model inference. -### SGLang: A Simple Option for Offline Batch Inference +### SGLang: A Simple Option for Offline Batch Inference (Supplement Material) For cases where users only want to run batch inference and do not need an HTTP endpoint, SGLang provides a much simpler offline engine API compared to running a full vLLM server. It is particularly suitable for dataset processing, evaluation pipelines, and one-off large-scale inference jobs. For more details and examples, see the official SGLang offline engine documentation: https://docs.sglang.io/basic_usage/offline_engine_api.html