Skip to content

Feat/genai Add GenAI Cost Visibility Plugin with Token- and MIG-Aware Efficiency Metrics (Sidecar Architecture)#69

Open
nXtCyberNet wants to merge 6 commits intoopencost:mainfrom
nXtCyberNet:feat/genai
Open

Feat/genai Add GenAI Cost Visibility Plugin with Token- and MIG-Aware Efficiency Metrics (Sidecar Architecture)#69
nXtCyberNet wants to merge 6 commits intoopencost:mainfrom
nXtCyberNet:feat/genai

Conversation

@nXtCyberNet
Copy link

@nXtCyberNet nXtCyberNet commented Jan 29, 2026

This PR introduces a GenAI Cost Visibility Plugin for OpenCost, implemented as a sidecar using the HashiCorp gRPC plugin system.

The plugin enriches OpenCost allocations with GenAI-specific efficiency signals by joining:

  • existing OpenCost cost allocations (CPU / RAM / GPU), and
  • external GenAI telemetry (tokens, GPU active time, utilization),

without modifying OpenCost’s core allocation logic.

The design is opt-in, stateless, vendor-neutral, and compatible with shared GPUs and MIG-based deployments.


Motivation / Problem

LLM and GenAI workloads consume expensive GPU resources, but are traditionally measured only as $/hour spend.

This prevents platform and FinOps teams from understanding efficiency, as cost is not linked to output.

Before this change, users could not answer:

  • How many tokens/sec are generated per GPU or pod?
  • What is the cost per token for inference vs training?
  • Which model or phase delivers the best efficiency per dollar?
  • Where GPU spend is driven by real demand vs idle reservation?

This PR enables productivity-per-cost visibility for GenAI workloads.


High-Level Design

  • Implemented as a standalone plugin process
  • Communicates with OpenCost over gRPC via Unix Domain Socket
  • Uses OpenCost CustomCostProvider interface
  • All GenAI logic isolated from OpenCost core
  • All efficiency metrics derived at query time

Architecture

Plugin Entry Point (main.go)

  • Plugin bootstrap and gRPC server setup
  • Configuration loading
  • Secure handshake using magic cookies
  • Service registration with HashiCorp go-plugin

Configuration Layer (genai-config.json, config.go)

  • Prometheus connection configuration
  • Metric name customization and overrides
  • Support for vLLM and custom GenAI runtimes
  • Annotation-based overrides with safe fallbacks

Protocol Layer (custom_cost.proto)

  • gRPC contract definition
  • Strongly typed request/response models

Provider / Interface Layer (provider.go, interface.go)

  • Plugin framework integration
  • gRPC client/server wrappers
  • Data transformation between protobuf and internal models

Efficiency Calculation Engine (helpers.go)

Pure, stateless computation layer:

  • Token cost normalization (per 1M tokens)
  • Tokens per GPU-second
  • GPU utilization and waste calculation
  • Efficiency classification (Underutilized / Optimal / Saturated)
  • MIG-aware normalization

Data Integration Layer (join.go)

  • Pod → node → GPU correlation
  • GenAI metadata extraction from annotations
  • MIG capacity mapping
  • Main workload aggregation logic

MIG Support (mig.go)

  • GPU slice utilization tracking
  • Capacity-aware cost allocation
  • Node-level GPU waste aggregation

Data Sources (prom_source.go, scrape_source.go)

  • Prometheus-based telemetry ingestion
  • Direct metrics scraping fallback

Implemented GenAI Telemetry

Metrics

  • llm_tokens_emitted_total
  • llm_gpu_seconds_total
  • llm_cpu_seconds_total (optional)

Attributes

  • workflow.phase
  • gen_ai.model.name
  • gen_ai.model.version
  • tenant.id / cost_center
  • accelerator.type
  • gpu.uuid / mig.uuid
  • Kubernetes workload context

Limitations & Known Constraints

Unequal MIG Memory Partitioning

When MIG slices have unequal RAM partitioning, cost attribution may be inaccurate due to proportional normalization assumptions.

LLM Cache Handling

Cache efficiency is not implemented. Any cache-related logic is currently hard-coded / placeholder-only and not driven by telemetry.


Testing Status

End-to-end validation in a fully configured Kubernetes + OpenCost environment is pending due to local environment constraints.


Why This Approach

  • Keeps GenAI logic isolated and opt-in
  • Avoids changes to OpenCost core
  • Works with shared GPUs and MIG
  • Treats telemetry as best-effort
  • Derives metrics at query time

Scope

In Scope

  • Token-aware cost attribution
  • GPU and MIG efficiency metrics
  • Sidecar plugin architecture

Out of Scope

  • Cache savings attribution
  • Persistent metric storage

Future Work (Non-Blocking)

  • Memory-weighted MIG normalization
  • Standardized OTEL GenAI metrics
  • Cache efficiency once portable telemetry exists

Related to opencost/opencost#3533

Signed-off-by: Rohan Dev <rohantech2005@gmail.com>
Signed-off-by: Rohan Dev <rohantech2005@gmail.com>
Signed-off-by: Rohan Dev <rohantech2005@gmail.com>
Signed-off-by: Rohan Dev <rohantech2005@gmail.com>
Signed-off-by: Rohan Dev <rohantech2005@gmail.com>
Signed-off-by: Rohan Dev <rohantech2005@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant