Evals From Traces Deployed on AWS #10

afarntrog · 2026-01-28T19:14:39Z

This pull request introduces the initial infrastructure and core evaluation logic for the Strands Evals project. It provides AWS CDK stacks for deploying the dashboard and evaluation pipeline, implements Lambda functions for dashboard authentication and evaluation execution, and establishes a modular system for configuring and registering new evaluators. The changes include extensive documentation and a well-structured approach to adding new evaluation types.

Evaluation Flow

→ SQS Message (session_id, eval_type)
→ Lambda Trigger
→ Fetch session from Langfuse
→ Run evaluators (based on eval_type config)
→ Export results to S3

Infrastructure and Deployment (CDK & Lambda):

Added a comprehensive README.md detailing the architecture, setup, deployment, and usage of the Strands Evals infrastructure, including stack outputs and instructions for updating credentials and adding new evaluators.
Introduced cdk.json to configure the AWS CDK app, including watch settings and context for best practices and resource policies.
Implemented a Lambda@Edge function (lambda/basic-auth/index.js) that enforces Basic Auth for the dashboard, with credentials injected at deploy time from AWS Secrets Manager.

Evaluation Pipeline Core Logic:

Added eval_configs.py to map evaluation types (e.g., github_issue, release_notes) to their evaluator configurations, with a registry for easy extension and error handling for unknown types.
Created an evaluators module with an __init__.py that registers custom evaluators for agent evaluation, enabling modular addition and import.

Custom Evaluators Implementation:

Implemented ConciseResponseEvaluator (concise_response.py), which uses an LLM judge to assess if agent responses are appropriately concise, providing structured output and normalization of scores.
Added ExpectedTrajectoryEvaluator (expected_trajectory.py), a pure Python evaluator that compares actual tool usage against the expected trajectory using recall, precision, and F1 metrics, with detailed reasoning in the output.

…ipeline Add complete CDK infrastructure for Strands Evals system with two stacks: - DashboardStack: S3 bucket, CloudFront distribution with Lambda@Edge basic authentication, and Secrets Manager for credentials - EvalPipelineStack: SQS queue for triggering evaluations, Lambda function for running post-hoc evals, and Langfuse integration Includes comprehensive documentation, deployment scripts, and TypeScript configuration for infrastructure as code management.

- Create eval_configs.py with EvalConfig dataclass to map eval types to their evaluator configurations - Add configuration entries for github_issue, release_notes, and reviewer evaluation types - Expand .gitignore with patterns for node_modules, CDK output, TypeScript build files, IDE files, and Python venv - Simplify eval runner by removing verbose logging and redundant calculations - Rename run_direct_session_evaluation to run_session_evaluation

- Remove PostHocEvaluator class dependency and import - Call mapper.get_session() directly instead of through evaluator - Remove verbose input/output preview debug logging - Streamline session evaluation initialization code

Remove unnecessary `create_post_hoc_task` factory function and prefetched sessions dictionary. Replace with an inline task function that directly captures the session and agent_output variables already in scope, reducing complexity for the single-session evaluation case.

lizradway · 2026-01-28T19:36:29Z

might be nice to break apart eval scripts from cdk infra code for ease of review/legibility

afarntrog · 2026-01-28T19:55:27Z

cdk-evals/lambda/eval-runner/handler.py

+    """Run evaluation on a single session by ID."""
+    logger.info(f"Running session evaluation: session_id={session_id}, eval_type={eval_type}")
+
+    mapper = LangfuseSessionMapper()


Would consider refactoring this a bit later on when we introduce other Mapper types. We can handle a case where one agent is sending traces to langfuse and one to cloudwatch etc. Just not over optimizing now

afarntrog · 2026-01-28T20:12:46Z

Also, this will trigger per run (the script sends the message to SQS after the agent is done) so for a multi-turn conversation we will have an evaluation run per run.

There are some ways I'm thinking to get around this:

We leave it. Maybe it's even helpful to see how the evaluations progress after each complete interaction
Use the session id as the id stored in S3 so that each run will overwrite the previous. Will still run each time but we will only have one evaluation per agent
we provide the agent with a tool to send the message to SQS once it's complete. So if the agent has follow ups then it ideally won't put the message into SQS. The obvious downsides is it's non-deterministic. So it may call too many times or none etc

afarntrog added 6 commits January 27, 2026 16:53

feat: simplify session fetching by removing PostHocEvaluator

cf53170

- Remove PostHocEvaluator class dependency and import - Call mapper.get_session() directly instead of through evaluator - Remove verbose input/output preview debug logging - Streamline session evaluation initialization code

use logs instead of print

74ae3e3

use logs instead of print

f7da074

afarntrog requested a review from Unshure January 28, 2026 19:15

afarntrog commented Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evals From Traces Deployed on AWS #10

Evals From Traces Deployed on AWS #10

Uh oh!

afarntrog commented Jan 28, 2026

Uh oh!

lizradway commented Jan 28, 2026

Uh oh!

afarntrog Jan 28, 2026

Uh oh!

afarntrog commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Evals From Traces Deployed on AWS #10

Are you sure you want to change the base?

Evals From Traces Deployed on AWS #10

Uh oh!

Conversation

afarntrog commented Jan 28, 2026

Uh oh!

lizradway commented Jan 28, 2026

Uh oh!

afarntrog Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

afarntrog commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants