-
Notifications
You must be signed in to change notification settings - Fork 2
Evals From Traces Deployed on AWS #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ipeline Add complete CDK infrastructure for Strands Evals system with two stacks: - DashboardStack: S3 bucket, CloudFront distribution with Lambda@Edge basic authentication, and Secrets Manager for credentials - EvalPipelineStack: SQS queue for triggering evaluations, Lambda function for running post-hoc evals, and Langfuse integration Includes comprehensive documentation, deployment scripts, and TypeScript configuration for infrastructure as code management.
- Create eval_configs.py with EvalConfig dataclass to map eval types to their evaluator configurations - Add configuration entries for github_issue, release_notes, and reviewer evaluation types - Expand .gitignore with patterns for node_modules, CDK output, TypeScript build files, IDE files, and Python venv - Simplify eval runner by removing verbose logging and redundant calculations - Rename run_direct_session_evaluation to run_session_evaluation
- Remove PostHocEvaluator class dependency and import - Call mapper.get_session() directly instead of through evaluator - Remove verbose input/output preview debug logging - Streamline session evaluation initialization code
Remove unnecessary `create_post_hoc_task` factory function and prefetched sessions dictionary. Replace with an inline task function that directly captures the session and agent_output variables already in scope, reducing complexity for the single-session evaluation case.
|
might be nice to break apart eval scripts from cdk infra code for ease of review/legibility |
| """Run evaluation on a single session by ID.""" | ||
| logger.info(f"Running session evaluation: session_id={session_id}, eval_type={eval_type}") | ||
|
|
||
| mapper = LangfuseSessionMapper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would consider refactoring this a bit later on when we introduce other Mapper types. We can handle a case where one agent is sending traces to langfuse and one to cloudwatch etc. Just not over optimizing now
|
Also, this will trigger per run (the script sends the message to SQS after the agent is done) so for a multi-turn conversation we will have an evaluation run per run. There are some ways I'm thinking to get around this:
|
This pull request introduces the initial infrastructure and core evaluation logic for the Strands Evals project. It provides AWS CDK stacks for deploying the dashboard and evaluation pipeline, implements Lambda functions for dashboard authentication and evaluation execution, and establishes a modular system for configuring and registering new evaluators. The changes include extensive documentation and a well-structured approach to adding new evaluation types.
Evaluation Flow
Infrastructure and Deployment (CDK & Lambda):
README.mddetailing the architecture, setup, deployment, and usage of the Strands Evals infrastructure, including stack outputs and instructions for updating credentials and adding new evaluators.cdk.jsonto configure the AWS CDK app, including watch settings and context for best practices and resource policies.lambda/basic-auth/index.js) that enforces Basic Auth for the dashboard, with credentials injected at deploy time from AWS Secrets Manager.Evaluation Pipeline Core Logic:
eval_configs.pyto map evaluation types (e.g.,github_issue,release_notes) to their evaluator configurations, with a registry for easy extension and error handling for unknown types.evaluatorsmodule with an__init__.pythat registers custom evaluators for agent evaluation, enabling modular addition and import.Custom Evaluators Implementation:
ConciseResponseEvaluator(concise_response.py), which uses an LLM judge to assess if agent responses are appropriately concise, providing structured output and normalization of scores.ExpectedTrajectoryEvaluator(expected_trajectory.py), a pure Python evaluator that compares actual tool usage against the expected trajectory using recall, precision, and F1 metrics, with detailed reasoning in the output.