Skip to content

Conversation

@afarntrog
Copy link
Contributor

This pull request introduces the initial infrastructure and core evaluation logic for the Strands Evals project. It provides AWS CDK stacks for deploying the dashboard and evaluation pipeline, implements Lambda functions for dashboard authentication and evaluation execution, and establishes a modular system for configuring and registering new evaluators. The changes include extensive documentation and a well-structured approach to adding new evaluation types.

Evaluation Flow

→ SQS Message (session_id, eval_type)
→ Lambda Trigger
→ Fetch session from Langfuse
→ Run evaluators (based on eval_type config)
→ Export results to S3

Infrastructure and Deployment (CDK & Lambda):

  • Added a comprehensive README.md detailing the architecture, setup, deployment, and usage of the Strands Evals infrastructure, including stack outputs and instructions for updating credentials and adding new evaluators.
  • Introduced cdk.json to configure the AWS CDK app, including watch settings and context for best practices and resource policies.
  • Implemented a Lambda@Edge function (lambda/basic-auth/index.js) that enforces Basic Auth for the dashboard, with credentials injected at deploy time from AWS Secrets Manager.

Evaluation Pipeline Core Logic:

  • Added eval_configs.py to map evaluation types (e.g., github_issue, release_notes) to their evaluator configurations, with a registry for easy extension and error handling for unknown types.
  • Created an evaluators module with an __init__.py that registers custom evaluators for agent evaluation, enabling modular addition and import.

Custom Evaluators Implementation:

  • Implemented ConciseResponseEvaluator (concise_response.py), which uses an LLM judge to assess if agent responses are appropriately concise, providing structured output and normalization of scores.
  • Added ExpectedTrajectoryEvaluator (expected_trajectory.py), a pure Python evaluator that compares actual tool usage against the expected trajectory using recall, precision, and F1 metrics, with detailed reasoning in the output.

…ipeline

Add complete CDK infrastructure for Strands Evals system with two stacks:

- DashboardStack: S3 bucket, CloudFront distribution with Lambda@Edge
  basic authentication, and Secrets Manager for credentials
- EvalPipelineStack: SQS queue for triggering evaluations, Lambda
  function for running post-hoc evals, and Langfuse integration

Includes comprehensive documentation, deployment scripts, and
TypeScript configuration for infrastructure as code management.
- Create eval_configs.py with EvalConfig dataclass to map eval types
  to their evaluator configurations
- Add configuration entries for github_issue, release_notes, and
  reviewer evaluation types
- Expand .gitignore with patterns for node_modules, CDK output,
  TypeScript build files, IDE files, and Python venv
- Simplify eval runner by removing verbose logging and redundant
  calculations
- Rename run_direct_session_evaluation to run_session_evaluation
- Remove PostHocEvaluator class dependency and import
- Call mapper.get_session() directly instead of through evaluator
- Remove verbose input/output preview debug logging
- Streamline session evaluation initialization code
Remove unnecessary `create_post_hoc_task` factory function and
prefetched sessions dictionary. Replace with an inline task function
that directly captures the session and agent_output variables already
in scope, reducing complexity for the single-session evaluation case.
@afarntrog afarntrog requested a review from Unshure January 28, 2026 19:15
@lizradway
Copy link
Member

might be nice to break apart eval scripts from cdk infra code for ease of review/legibility

"""Run evaluation on a single session by ID."""
logger.info(f"Running session evaluation: session_id={session_id}, eval_type={eval_type}")

mapper = LangfuseSessionMapper()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would consider refactoring this a bit later on when we introduce other Mapper types. We can handle a case where one agent is sending traces to langfuse and one to cloudwatch etc. Just not over optimizing now

@afarntrog
Copy link
Contributor Author

Also, this will trigger per run (the script sends the message to SQS after the agent is done) so for a multi-turn conversation we will have an evaluation run per run.

There are some ways I'm thinking to get around this:

  • We leave it. Maybe it's even helpful to see how the evaluations progress after each complete interaction
  • Use the session id as the id stored in S3 so that each run will overwrite the previous. Will still run each time but we will only have one evaluation per agent
  • we provide the agent with a tool to send the message to SQS once it's complete. So if the agent has follow ups then it ideally won't put the message into SQS. The obvious downsides is it's non-deterministic. So it may call too many times or none etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants