Aether-Guard is a distributed infrastructure monitoring and migration system that combines a high-performance C++ agent, a .NET Core control plane, a Python AI engine, and a Next.js dashboard for real-time telemetry and risk-aware recovery.
This README documents the current implementation in this repo (v1.x) and the v2.2 reference architecture with a concrete implementation guide.
- Stage: v2.2 baseline delivered (Phase 0-4). Remaining productization gaps tracked below.
- License: MIT
- Authors: Qi Junyi, Xiao Erdong (2026)
This project targets a product-grade release, not a demo. The following standards are required for delivery.
- Time-to-value (TTFV): first telemetry visible in under 15 minutes.
- One-command deploy, one-command fire drill, one-command rollback.
- Self-check and guidance: CLI/scripts validate dependencies, ports, CRIU/eBPF, and permissions with actionable errors.
- Guided first-run: register agent, receive data, trigger drill, observe migration.
- Explainability: AI risk reason, migration decision, and failures visible in the UI.
- Recovery help: diagnostics bundle export with logs, snapshots, and config.
- Docs-as-product: README, Quickstart, Troubleshooting, FAQ, deploy/upgrade/rollback.
- Security and trust chain: auth for Agent/Telemetry/Artifact/Command, mTLS with rotation, audit logs, SBOM/SLSA/signing.
- Reliability and resilience: idempotency, retries with backoff, rate limits, circuit breakers, MQ backpressure and DLQ.
- Observability: OpenTelemetry traces/metrics/logs, consistent trace_id, health and readiness probes.
- Deployment and operations: Helm + Compose, config validation, backup/restore, runbooks.
- Data governance: schema registry and compatibility, retention/cleanup, snapshot lifecycle, migrations.
- Compatibility and evolution: API versioning, capability negotiation, deprecation policy.
- Performance and scale: streaming uploads/downloads, capacity baselines, horizontal scaling strategy.
- Agent (C++): REST/JSON telemetry; CRIU checkpointing with automatic simulation fallback.
- Core API (.NET 8): REST controllers plus gRPC services with JSON transcoding; RabbitMQ ingestion worker with W3C trace context propagation; migration orchestration; PostgreSQL storage.
- Protobuf contracts: shared schemas in src/shared/protos (AgentService + ControlPlane).
- Agent handshake: registration accepts capability payloads and returns AgentConfig to drive feature gating.
- AI Engine (FastAPI): volatility and trend rules; Core currently sends empty spotPriceHistory (see Risk Logic).
- Dashboard (Next.js): telemetry and command visibility with NextAuth credentials.
- Storage: snapshots stored on local filesystem by default; optional S3/MinIO backend with retention sweeper and S3 lifecycle support.
- Security: API key for command endpoints; SPIRE mTLS for agent/core; OpenTelemetry baseline across core/AI/dashboard.
- Supply chain: SBOM generation, cosign signing, and SLSA provenance in CI.
- Diagnostics bundle export available (API + dashboard).
- No end-to-end auth on telemetry or artifacts; command API key only.
- No schema registry or compatibility policy for MQ events.
- Agent-side OpenTelemetry spans are not yet emitted (server-side spans/metrics are wired).
- Core API enables gRPC + JSON transcoding so internal traffic uses Protobuf and external clients keep REST/JSON.
- W3C trace context must propagate across HTTP and RabbitMQ by injecting traceparent and tracestate headers.
- Agent performs capability probe (kernel, CRIU, eBPF, feature flags) and reports a Capabilities payload at registration.
- Core responds with AgentConfig to enable or disable features based on compatibility and policy.
- RabbitMQ messages use explicit schemas (Avro/Protobuf) with schema_id + payload, plus upcaster logic for old events.
- Snapshots move to object storage (MinIO/S3) with hot, warm, and cold retention policies.
- Workload identity uses SPIFFE/SPIRE with short-lived SVIDs; mTLS replaces static certs.
- Supply chain uses SLSA provenance, SBOM, and signed images.
- RabbitMQ uses QoS prefetch and explicit ack; failures route to DLQ.
- Idempotency keys are required for critical commands; agents cache recent request_ids to avoid re-execution.
- Policy plugins run in Wasmtime with fuel and memory limits.
- Runbook automation triggers scripts and attaches artifacts to alerts.
- Add self-check scripts (agent/core dependencies, ports, permissions).
- Add first-run guide in the dashboard.
- Add explainability fields and failure reasons in UI.
- Add diagnostics bundle export.
- Expand docs: Quickstart, Troubleshooting, FAQ, upgrade/rollback.
- Enable gRPC JSON transcoding in Core.
- Define Protobuf contracts for Agent/Core APIs.
- Inject W3C trace context into RabbitMQ headers.
- Add DetectCapabilities() in the Agent boot sequence.
- Extend /register to accept Capabilities and return AgentConfig.
- Introduce SPIRE or cert-manager based certificate rotation.
- Deploy MinIO (S3 compatible) for snapshots.
- Update ArtifactController to stream to S3 SDK.
- Add SLSA provenance generation in CI.
- Add OpenTelemetry collector + Jaeger and wire core/AI/dashboard exporters.
- Add snapshot retention automation and S3 lifecycle policy support.
- Generate SBOMs and sign container images with cosign in CI.
- Agent (C++) -> Core API (.NET) -> AI Engine (FastAPI) -> Core API -> PostgreSQL -> Dashboard (Next.js)
- agent-service: C++ telemetry agent with CRIU-based checkpointing (auto-falls back to simulation mode when CRIU is unavailable).
- core-service: ASP.NET Core API for ingestion, analysis, migration orchestration, and data access.
- ai-service: FastAPI service for volatility-based risk scoring.
- web-service: Next.js dashboard with authentication and visualization.
- db: PostgreSQL for persistence.
- rabbitmq: message broker for telemetry ingestion.
- redis: dedup cache for telemetry ingestion.
- minio: S3-compatible object storage for snapshot artifacts (enabled in docker-compose).
- Core API: http://localhost:5000
- Core API (mTLS): https://localhost:5001
- Dashboard: http://localhost:3000
- AI Engine: http://localhost:8000
- PostgreSQL: localhost:5432
- RabbitMQ Management: http://localhost:15672
- MinIO API: http://localhost:9000
- MinIO Console: http://localhost:9001
- OpenTelemetry OTLP gRPC: http://localhost:4317
- OpenTelemetry OTLP HTTP: http://localhost:4318
- Jaeger UI: http://localhost:16686
docker compose up --build -dOpen the dashboard at http://localhost:3000.
- Quickstart: docs/Quickstart.md
- Troubleshooting: docs/Troubleshooting.md
- FAQ: docs/FAQ.md
- Upgrade and rollback: docs/Upgrade-Rollback.md
- SPIRE mTLS: docs/SPIRE-mTLS.md
- Observability (OpenTelemetry): docs/Observability.md
If you want to simulate migrations, start at least two agents:
docker compose up -d --scale agent-service=2 agent-servicepython scripts/self_check.py --target dockerdocker compose up --build -d- Open http://localhost:3000 and log in.
- Start at least two agents (see command above).
- Run the fire drill:
python scripts/fire_drill.py start - Confirm the dashboard shows risk state changes and migration activity.
Trigger a market crash simulation:
python scripts/fire_drill.py startReset back to stable:
python scripts/fire_drill.py stopRun before first deployment or local development:
python scripts/self_check.py --target dockerFor local (non-Docker) development:
python scripts/self_check.py --target localUse --allow-port-in-use if services are already running.
Export a support bundle (config redacted, recent telemetry/audits, snapshot manifest, optional snapshots):
curl -H "X-API-Key: $COMMAND_API_KEY" \
"http://localhost:5000/api/v1/diagnostics/bundle?includeSnapshots=true" \
--output aetherguard-diagnostics.zipTune snapshot limits with maxSnapshots, maxSnapshotBytes, and maxTotalSnapshotBytes.
Admins can also export a bundle from the dashboard Control Panel.
These scripts validate demo flows and can be reused as product readiness checks:
verify_blueprint_v1.pyverify_phase2.pyverify_phase3.py
If snapshot storage is set to S3/MinIO, run AG_STORAGE_PROVIDER=s3 python verify_phase3.py
to skip the local filesystem assertion.
- Username: admin
- Password: admin123
Override via environment variables:
- DASHBOARD_ADMIN_USER
- DASHBOARD_ADMIN_PASSWORD
Core API database connection (docker-compose.yml):
- ConnectionStrings__DefaultConnection=Host=db;Database=AetherGuardDb;Username=postgres;Password=password
Core API artifact base URL (docker-compose.yml):
- ArtifactBaseUrl=https://core-service:8443
Snapshot storage (docker-compose.yml):
- SnapshotStorage__Provider=S3
- SnapshotStorage__S3__Bucket=aether-guard-snapshots
- SnapshotStorage__S3__Region=us-east-1
- SnapshotStorage__S3__Endpoint=http://minio:9000
- SnapshotStorage__S3__AccessKey=minioadmin
- SnapshotStorage__S3__SecretKey=minioadmin
- SnapshotStorage__S3__UsePathStyle=true
- SnapshotStorage__S3__Prefix=snapshots
To keep local filesystem storage, set SnapshotStorage__Provider=Local (or remove the S3 settings).
Snapshot retention (core-service):
- SnapshotRetention__MaxAgeDays=14
- SnapshotRetention__MaxSnapshotsPerWorkload=5
- SnapshotRetention__MaxTotalSnapshots=500
- SnapshotRetention__ApplyS3Lifecycle=true
Dashboard auth (docker-compose.yml):
- AUTH_SECRET=super-secret-key
- AUTH_TRUST_HOST=true
For production, set a strong AUTH_SECRET and use a secret manager.
The default docker-compose stack now provisions SPIRE (server + agent) plus spiffe-helper sidecars to issue and rotate X.509 SVIDs:
- Core serves mTLS on
https://core-service:8443(host-mapped to 5001). - Agent uses SPIFFE-issued certs from
/run/spiffe/certsand calls the mTLS endpoint. - HTTP on
http://core-service:8080remains for dashboard/AI traffic.
Disable mTLS locally by setting:
Security__Mtls__Enabled=false(core-service)AG_MTLS_ENABLED=falseandAG_CORE_URL=http://core-service:8080(agent-service)
The default stack includes an OpenTelemetry Collector and Jaeger:
- Collector receives OTLP on
4317(gRPC) and4318(HTTP). - Jaeger UI runs on http://localhost:16686.
Core, AI, and Dashboard emit spans and metrics to the collector when enabled via
environment variables in docker-compose.yml.
Core API (Legacy REST - v1):
- POST /api/v1/ingestion - receive telemetry from agent
- GET /api/v1/dashboard/latest - latest telemetry + AI analysis
- GET /api/v1/dashboard/history - last 20 telemetry records (chronological)
- POST /api/v1/market/signal - update market signal file
- POST /api/v1/artifacts/upload/{workloadId} - upload snapshot
- GET /api/v1/artifacts/download/{workloadId} - download latest snapshot
- GET /api/v1/diagnostics/bundle - export diagnostics bundle (requires X-API-Key)
Core API (gRPC + JSON Transcoding - v2):
- POST /api/v2/agent/register
- POST /api/v2/agent/heartbeat
- GET /api/v2/agent/poll
- POST /api/v2/agent/feedback
- POST /api/v2/ingestion
- POST /api/v2/commands/queue
- GET /api/v2/dashboard/latest
- GET /api/v2/dashboard/history
AI Engine:
- POST /analyze - classify telemetry with spotPriceHistory, rebalanceSignal, capacityScore
The demo uses file-based signals that are mounted into containers via docker-compose:
- Core signal: src/services/core-dotnet/AetherGuard.Core/Data/market_signal.json
- AI prices: src/services/ai-engine/Data/spot_prices.json
The fire drill script writes these files and creates the directories if missing.
Risk scoring uses these rules:
- rebalanceSignal=true: CRITICAL (Cloud Provider Signal)
- Trend > 0.2: CRITICAL (Price Spike Detected)
- Volatility > 5.0: CRITICAL (Market Instability)
- Otherwise: LOW (Stable)
Note: The Core API currently sends an empty spotPriceHistory list; wire that data into Analyze requests to drive volatility decisions.
TelemetryRecord persisted to PostgreSQL:
- AgentId
- WorkloadTier
- RebalanceSignal
- DiskAvailable
- CpuUsage (defaults to 0 in the current pipeline)
- MemoryUsage (defaults to 0 in the current pipeline)
- AiStatus
- AiConfidence
- RootCause
- PredictedCpu
- Timestamp (UTC)
For production, add EF Core migrations and a formal upgrade process.
Dashboard:
cd src/web/dashboard
npm install
npm run devCore API:
cd src/services/core-dotnet/AetherGuard.Core
dotnet restore
dotnet runAI Engine:
cd src/services/ai-engine
python -m venv .venv
.venv/Scripts/activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000C++ Agent:
cd src/services/agent-cpp
cmake -S . -B build
cmake --build build
./build/AetherAgentWindows build note: install Visual Studio Build Tools, CMake, Ninja, and NASM, and run the build from a VS Developer Command Prompt.
Note: If CRIU is unavailable (Windows/Docker Desktop), the agent runs in simulation mode and still produces a valid snapshot archive for demo flows.
- Authentication uses NextAuth Credentials for the MVP; use an external identity provider for production.
- CORS is limited to http://localhost:3000 in development.
- Secrets and credentials must be rotated for any public deployment.
The supply-chain workflow builds container images, generates SBOMs, signs with cosign, and
emits SLSA provenance for each image. Artifacts are uploaded to GitHub Actions and images are
pushed to GHCR under ghcr.io/<owner>/aether-guard/....
Please read CONTRIBUTING.md for setup, workflow, and PR guidelines.
MIT License. See LICENSE.
