A benchmark for evaluating AI coding assistants on real GitHub issues. This project includes a curated dataset of GitHub issues with Dockerfiles for reproducible evaluation, plus tools for dataset creation and AI agent evaluation.
Get started immediately with our pre-built dataset:
# 1. Clone and install
git clone https://github.com/your-org/CodeAssistBench.git
cd CodeAssistBench
pip install -r requirements.txt && pip install -e .
# 2. Set AWS credentials (for Bedrock)
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-west-2
# 3. Run evaluation on 10 Python issues
python -m cab_evaluation.cli generation-dataset \
dataset/cab_recent.jsonl \
--output results/quick_test.jsonl \
--agent-models '{"maintainer": "haiku", "user": "haiku"}' \
--language python
# 4. Judge the results
python -m cab_evaluation.cli evaluation-dataset \
results/quick_test.jsonl \
--output results/quick_eval.jsonl \
--agent-models '{"judge": "haiku"}'
# 5. View results
python -c "
import json
with open('results/quick_eval.jsonl') as f:
for line in f:
r = json.loads(line)
print(f\"{r['issue_id']}: {r['verdict']}\")
"What this does:
- Generates maintainer responses for Python issues using Claude Haiku (fast & cheap)
- Evaluates responses with a judge agent
- Outputs verdicts:
CORRECT,PARTIALLY_CORRECT,INCORRECT, orERROR
For production evaluation, use sonnet4 or opus models instead of haiku.
CodeAssistBench provides three ready-to-use datasets:
| Dataset | Issues | Languages | Description |
|---|---|---|---|
dataset/cab_recent_v2.jsonl |
771 | 7 | Latest - June 2025 - Jan 2026 (with satisfaction conditions & classification) |
dataset/cab_recent.jsonl |
308 | 7 | Recent issues (June 2025 - Jan 2026) |
dataset/cab_verified.jsonl |
149 | 7 | Verified subset with tested Dockerfiles |
Each issue in the dataset contains:
{
"number": 1234,
"title": "Bug: Memory leak in parser",
"created_at": "2025-07-15T10:30:00Z",
"closed_at": "2025-07-20T14:22:00Z",
"commit_id": "abc123def456...",
"labels": ["bug", "parser"],
"url": "https://github.com/owner/repo/issues/1234",
"body": "When parsing large files, memory usage grows unbounded...",
"author": "user123",
"comments": [
{
"user": "maintainer",
"created_at": "2025-07-16T08:00:00Z",
"body": "Thanks for reporting! Can you share the file?"
}
],
"satisfaction_conditions": [
"Memory usage remains stable when parsing files >100MB",
"Parser handles all edge cases mentioned in the issue",
"No regression in parsing speed for normal files"
],
"_classification": {
"category": "Can be dockerized without any issue",
"timestamp": "2025-04-14 01:01:54"
},
"dockerfile": "FROM python:3.11-slim\n...",
"language": "python"
}This section walks through how we generated the dataset from scratch using AWS Bedrock and Strands AI agents.
# 1. Clone and setup
git clone https://github.com/your-org/CodeAssistBench.git
cd CodeAssistBench
python3 -m venv venv
source venv/bin/activate
# 2. Install dependencies
pip install -r requirements.txt
pip install -e .
# 3. Install Strands SDK (required for Dockerfile generation)
pip install strands-agents strands-agents-tools
pip install -e tools/
# 4. Set up LLM credentials (choose ONE option)
# Option A: AWS Bedrock (Claude models)
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-west-2
# Option B: OpenAI (GPT-5 models)
export OPENAI_API_KEY=your_openai_api_key
# 5. Set up GitHub token (for API access)
export GITHUB_TOKEN=your_github_personal_access_tokenCollect closed issues from popular repositories. The script uses interactive prompts:
python script/get_github_issue.py
# Enter CSV path when prompted (see script/python_repos*.csv for examples)
# Choose label-based filtering (y/n)Or use the bulk collection script:
python script/collect_1000_issues.py
# Edit the script to set: language, min_stars, date rangeOutput: github_issues_<owner>_<repo>_<timestamp>.json
[
{
"number": 1234,
"title": "Bug: Memory leak in parser",
"url": "https://github.com/owner/repo/issues/1234",
"body": "When parsing large files...",
"comments": [...]
}
]Find the commit hash at the time each issue was closed:
python script/get_github_commit.py \
--input-dir my_data/collected_issues \
--output-dir my_data/with_commits
# Or using short options:
python script/get_github_commit.py -i my_data/collected_issues -o my_data/with_commitsArguments:
| Argument | Required | Description |
|---|---|---|
--input-dir, -i |
Yes | Directory containing JSON files with issues |
--output-dir, -o |
No | Output directory (default: github_commits) |
Output: Creates commit data files in the output directory.
Use LLM to generate explicit criteria for issue resolution:
python script/scon_filter.py \
--input-dir my_data/collected_issues \
--output-dir my_data/with_scon
# With custom model and region:
python script/scon_filter.py \
-i my_data/collected_issues \
-o my_data/with_scon \
--model us.anthropic.claude-sonnet-4-5-20250929-v1:0 \
--region us-west-2Arguments:
| Argument | Required | Default | Description |
|---|---|---|---|
--input-dir, -i |
Yes | - | Directory containing JSON files with issues |
--output-dir, -o |
Yes | - | Output directory for issues with satisfaction conditions |
--model, -m |
No | claude-sonnet-4.5 |
Bedrock model ID |
--region, -r |
No | us-west-2 |
AWS region for Bedrock |
Output: Adds satisfaction_conditions field:
{
"satisfaction_conditions": [
"Memory usage remains stable when parsing files >100MB",
"Parser handles all edge cases mentioned in the issue"
]
}Classify issues by whether they need a Docker environment:
python script/docker_filter.py \
--input-dir my_data/with_scon \
--output-dir my_data/classified
# With custom region:
python script/docker_filter.py \
-i my_data/with_scon \
-o my_data/classified \
--region us-east-1Arguments:
| Argument | Required | Default | Description |
|---|---|---|---|
--input-dir, -i |
Yes | - | Directory containing JSON files with issues |
--output-dir, -o |
Yes | - | Output directory for classified issues |
--region, -r |
No | us-west-2 |
AWS region for Bedrock |
Output structure:
my_data/classified/
βββ need_docker/ # Issues that need Docker environment
βββ no_need_docker/ # Documentation/config changes
βββ need_docker_but_cannot/ # Hardware-specific issues
βββ llm_responses/ # Raw LLM responses for debugging
βββ processed_issues.json # Resume checkpoint
# Option A: Using AWS Bedrock (Claude) - default
STRANDS_NON_INTERACTIVE=true BYPASS_TOOL_CONSENT=true \
python script/generate_dockerfile_with_strands.py \
--input-dir my_data/classified/need_docker \
--languages python \
--max-attempts 3 \
--parallel 2 \
--agent-timeout 180 \
--issue-timeout 600
# Option B: Using OpenAI (GPT-5)
STRANDS_NON_INTERACTIVE=true BYPASS_TOOL_CONSENT=true \
python script/generate_dockerfile_with_strands.py \
--input-dir my_data/classified/need_docker \
--languages python \
--max-attempts 3 \
--parallel 2 \
--agent-timeout 180 \
--issue-timeout 600 \
--model-id gpt5 \
--provider openaiWhat happens:
- Strands agent reads the issue and repository structure
- Agent generates a Dockerfile based on repo's build system
- Docker builds the image to verify it works
- If build fails, agent iterates with error feedback
- Success: Dockerfile is saved to the issue JSON
Output: Adds dockerfile field:
{
"dockerfile": "FROM python:3.11-slim\n\nWORKDIR /workspace\n\nRUN apt-get update && apt-get install -y git\n\nRUN git clone https://github.com/owner/repo.git . && \\\n git checkout abc123def456\n\nRUN pip install -r requirements.txt\n\nCMD [\"pytest\", \"tests/\"]\n"
}Combine all processed issues into a single JSONL file:
python script/convert_to_jsonl.py \
--input-dir my_data/classified/need_docker \
--output my_data/my_dataset.jsonlHere's a complete walkthrough processing 2 test issues through the entire pipeline:
cd CodeAssistBench
# Set up credentials (AWS Bedrock + GitHub)
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-west-2
export GITHUB_TOKEN=your_github_tokenCreate a directory with sample issues:
mkdir -p test_pipeline/step1_rawCreate test_pipeline/step1_raw/test_issues.json:
[
{
"number": 1234,
"title": "How to handle async operations in Python?",
"created_at": "2025-07-15T10:30:00Z",
"url": "https://github.com/python/cpython/issues/1234",
"body": "I'm trying to use async/await but get 'RuntimeWarning: coroutine was never awaited'.",
"author": "user123",
"comments": [
{"user": "maintainer", "created_at": "2025-07-16T08:00:00Z", "body": "Use asyncio.run() to execute your coroutine."},
{"user": "user123", "created_at": "2025-07-17T09:00:00Z", "body": "That worked perfectly!"}
]
}
]python3 script/scon_filter.py \
--input-dir test_pipeline/step1_raw \
--output-dir test_pipeline/step2_sconExpected output:
Processing directory: test_pipeline/step1_raw
Found 1 JSON files
Processing conversation 1/1 (ID: 1234)
Added satisfaction conditions for conversation 1234
Saved 1 processed conversations to test_pipeline/step2_scon/test_issues.json
python3 script/docker_filter.py \
--input-dir test_pipeline/step2_scon \
--output-dir test_pipeline/step3_classifiedExpected output:
Input directory: test_pipeline/step2_scon
Output directory: test_pipeline/step3_classified
Found 1 JSON files to process.
Classified issue #1234 as: Does not need build environment
--- Classification Summary ---
Total issues processed: 1
Does not need build environment: 1 issues (100.0%)
test_pipeline/
βββ step1_raw/
β βββ test_issues.json # Original issues
βββ step2_scon/
β βββ test_issues.json # + satisfaction_conditions
β βββ test_issues_prompts_responses.json
βββ step3_classified/
βββ no_need_docker/
β βββ test_issues.json # + _classification
βββ need_docker/ # (empty for this example)
βββ llm_responses/ # Raw LLM outputs
βββ classification_summary.json
# Check satisfaction conditions were added
cat test_pipeline/step2_scon/test_issues.json | jq '.[0].satisfaction_conditions'
# Check classification
cat test_pipeline/step3_classified/no_need_docker/test_issues.json | jq '.[0]._classification'See examples/ for sample outputs at each pipeline stage:
| File | Description |
|---|---|
examples/sample_dataset.jsonl |
Complete issues with all fields |
examples/sample_docker_based_issues.jsonl |
Issues requiring Docker |
examples/sample_non_docker_based_issues.jsonl |
Documentation/config issues |
examples/sample_pipeline_output.json |
Single issue showing all fields |
import json
# Load the dataset
with open('dataset/cab_recent.jsonl', 'r') as f:
issues = [json.loads(line) for line in f]
# Filter by language
python_issues = [i for i in issues if i.get('language') == 'python']
# Get issues with Dockerfiles
dockerized = [i for i in issues if i.get('dockerfile')]
print(f"Total issues: {len(issues)}")
print(f"Python issues: {len(python_issues)}")
print(f"With Dockerfiles: {len(dockerized)}")The evaluation framework has two phases: Generation (maintainer answers issues) and Evaluation (judge scores responses).
βββββββββββββββββββ ββββββββββββββββββ βββββββββββββββββββ
β Dataset β β β Generation β β β Evaluation β
β (JSONL) β β Workflow β β Workflow β
βββββββββββββββββββ ββββββββββββββββββ βββββββββββββββββββ
β β
Maintainer β User Judge Agent
Multi-round chat Scores answers
python -m cab_evaluation.cli generation-dataset \
dataset/cab_recent.jsonl \
--output results/generation_results.jsonl \
--agent-models '{"maintainer": "sonnet4", "user": "haiku"}' \
--language python \
--resumeArguments:
| Argument | Description |
|---|---|
--output, -o |
Output file (default: auto-generated with timestamp) |
--agent-models |
JSON mapping models: {"maintainer": "sonnet4", "user": "haiku"} |
--language, -l |
Filter by language (python, javascript, etc.) |
--resume |
Skip already-processed issues |
--max-conversation-rounds |
Max rounds between maintainer/user (default: 2) |
python -m cab_evaluation.cli evaluation-dataset \
results/generation_results.jsonl \
--output results/evaluation_results.jsonl \
--agent-models '{"judge": "sonnet4"}' \
--resumeArguments:
| Argument | Description |
|---|---|
--output, -o |
Output file for evaluation results |
--agent-models |
JSON with judge model: {"judge": "sonnet4"} |
--resume |
Skip already-evaluated issues |
--iterative |
Enable multi-iteration judge with repo exploration |
The judge assigns one of these verdicts:
| Verdict | Description |
|---|---|
CORRECT |
Response fully addresses the issue and satisfies all conditions |
PARTIALLY_CORRECT |
Response addresses some aspects but misses key elements |
INCORRECT |
Response doesn't address the issue or provides wrong information |
ERROR |
Processing failed (timeout, API error, etc.) |
Each result in the JSONL file contains:
{
"issue_id": "1234",
"question_title": "How to handle async operations?",
"verdict": "CORRECT",
"judgment": "The maintainer correctly identified the issue...",
"key_issues": ["Clear explanation provided", "Code example included"],
"alignment_score": {
"satisfied": 3,
"total": 3,
"percentage": 100.0,
"conditions": [
{"number": 1, "satisfied": true, "description": "Explains async pattern"},
{"number": 2, "satisfied": true, "description": "Provides working example"},
{"number": 3, "satisfied": true, "description": "Addresses RuntimeWarning"}
]
},
"generation_metadata": {
"user_satisfied": true,
"total_conversation_rounds": 2
}
}import json
from collections import Counter
# Load evaluation results
with open('results/evaluation_results.jsonl', 'r') as f:
results = [json.loads(line) for line in f]
# Count verdicts
verdicts = Counter(r['verdict'] for r in results)
print(f"Total: {len(results)}")
print(f"CORRECT: {verdicts['CORRECT']} ({verdicts['CORRECT']/len(results)*100:.1f}%)")
print(f"PARTIALLY_CORRECT: {verdicts['PARTIALLY_CORRECT']} ({verdicts['PARTIALLY_CORRECT']/len(results)*100:.1f}%)")
print(f"INCORRECT: {verdicts['INCORRECT']} ({verdicts['INCORRECT']/len(results)*100:.1f}%)")
print(f"ERROR: {verdicts.get('ERROR', 0)}")
# Average alignment score
valid_results = [r for r in results if r.get('alignment_score')]
avg_alignment = sum(r['alignment_score']['percentage'] for r in valid_results) / len(valid_results)
print(f"Average alignment: {avg_alignment:.1f}%")Available model shortcuts for --agent-models:
| Alias | Full Model ID |
|---|---|
sonnet4 |
us.anthropic.claude-sonnet-4-20250514-v1:0 |
sonnet45 |
us.anthropic.claude-sonnet-4-5-20250929-v1:0 |
haiku |
us.anthropic.claude-3-5-haiku-20241022-v1:0 |
opus |
us.anthropic.claude-opus-4-20250514-v1:0 |
See examples/USAGE_GUIDE.md for more detailed instructions.
CodeAssistBench/
βββ dataset/ # π Final datasets
β βββ cab_recent.jsonl # 308 recent issues
β βββ cab_verified.jsonl # 149 verified issues
β βββ recent/ # Additional samples
βββ src/cab_evaluation/ # π§ Evaluation framework
β βββ agents/ # Agent implementations
β βββ core/ # Core models and config
β βββ prompts/ # Prompt templates
β βββ utils/ # Utilities
β βββ workflows/ # Evaluation workflows
βββ script/ # π οΈ Data collection scripts
β βββ get_github_issue.py # Step 1: Issue collection
β βββ get_github_commit.py # Step 2: Commit ID lookup
β βββ scon_filter.py # Step 3: Satisfaction conditions
β βββ docker_filter.py # Step 4: Classification
β βββ generate_dockerfile_with_strands.py # Step 5: Dockerfiles
βββ tools/ # Custom Strands tools (required)
βββ examples/ # Sample data and guides
β βββ USAGE_GUIDE.md # Detailed usage guide
β βββ sample_*.jsonl # Sample datasets
βββ prompts/ # Prompt templates
βββ docs/ # Documentation
βββ DATA_PIPELINE.md # Detailed pipeline docs
# Clone the repository
git clone https://github.com/your-org/CodeAssistBench.git
cd CodeAssistBench
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Install in development mode
pip install -e .
# Install Strands SDK (REQUIRED for Dockerfile generation)
pip install strands-agents strands-agents-tools
pip install -e tools/export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-west-2- Usage Guide - Detailed evaluation instructions
- Data Pipeline - Complete pipeline documentation
- Development - Contributing and development setup
- Automated Dockerfile Generation: Uses Strands AI agents with AWS Bedrock
- Multi-language Support: Python, JavaScript, TypeScript, Java, Go, C, C++
- Satisfaction Conditions: LLM-generated criteria for issue resolution
- Docker-based Evaluation: Reproducible evaluation environment
- Multiple Agent Frameworks: Supports Strands, OpenHands, and Q-CLI
If you use CodeAssistBench in your research, please cite our paper:
@inproceedings{
kim2025codeassistbench,
title={CodeAssistBench ({CAB}): Dataset \& Benchmarking for Multi-turn Chat-Based Code Assistance},
author={Myeongsoo Kim and Shweta Garg and Baishakhi Ray and Varun Kumar and Anoop Deoras},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2025},
url={https://openreview.net/forum?id=2R6y4Ku9kG}
}This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
The underlying GitHub issues are subject to their respective repository licenses.
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
| Variable | Description |
|---|---|
STRANDS_NON_INTERACTIVE=true |
Required. Disables interactive prompts |
BYPASS_TOOL_CONSENT=true |
Required. Bypasses tool confirmation |
| Argument | Default | Description |
|---|---|---|
--input-dir, -i |
(required) | Directory with classified issues |
--output-dir, -o |
logs/dockerfile_generation_strands |
Output directory |
--languages |
(all) | Specific languages to process |
--max-attempts |
10 |
Max retry attempts per issue |
--docker-timeout |
600 |
Docker build timeout (seconds) |
--agent-timeout |
300 |
Agent attempt timeout (seconds) |
--issue-timeout |
1800 |
Total timeout per issue (seconds) |
--parallel, -p |
1 |
Parallel processing count |
--model-id |
claude-sonnet-4-5 |
AWS Bedrock model ID |