A full-featured Papers with Code clone with 575,626 papers, 3,957 tasks, 11,736 leaderboards, and 218,852 implementations.
- 📚 575,626 Machine Learning Papers - Complete paper database with search and browsing
- 🎯 3,957 Machine Learning Tasks - Categorized by research areas with hierarchical structure
- 🏆 11,736 Leaderboard System - SOTA results and model rankings for each task
- ⭐ Real-time GitHub Stars - Automatically fetch repository star counts
- 💻 218,852 Code Implementation Links - 218,852 code implementations
- 💬 Comment System - Nested comments with reply support
- 👥 User Following - Follow other researchers
- 📁 Categorized Collections - Custom collections to manage papers
- 📊 Reading History - Automatically track browsing history
- 📰 Activity Feed - View activities from followed users
- 🤖 AI Assistant - ChatGPT integration for paper analysis
- 🧠 Smart Insights - AI-generated paper insights
- 💡 Concept Explanation - AI explains research concepts
- ✨ LLM Task Classification - Intelligent paper categorization using GPT/Claude/Gemini
- 🔗 Smart GitHub Matching - AI-powered repository discovery for papers
- 🔐 JWT Authentication - Secure user authentication
- 👤 User Profiles - Personal pages and statistics
- 🎨 Modern UI - Responsive design with beautiful interface
- FastAPI - High-performance Python web framework
- SQLite - Lightweight database (includes 1.2GB complete data)
- Python 3.8+ - Backend development language
- React 18 - Frontend framework
- TypeScript - Type-safe JavaScript
- Tailwind CSS - Utility-first CSS framework
- Vite - Fast frontend build tool
- KaTeX - LaTeX math formula rendering
For production servers, use PM2 for process management and automatic restarts:
# Install PM2 globally (if not installed)
npm install -g pm2
# Start all services with PM2
./start-pm2.sh
# Or manually with:
pm2 start ecosystem.config.jsPM2 Features:
- ✅ Automatic restart on crash
- ✅ Log management and rotation
- ✅ Memory limit auto-restart
- ✅ Daily arXiv updates (2 AM cron job)
- ✅ System startup persistence
PM2 Management Commands:
pm2 status # View service status
pm2 logs # View all logs
pm2 logs papercode-backend # View backend logs
pm2 logs papercode-frontend # View frontend logs
pm2 restart all # Restart all services
pm2 stop all # Stop all services
pm2 monit # Real-time monitoring
pm2 startup # Setup auto-start on reboot
pm2 save # Save current process listLarge Files: The project contains >5GB of data files that are NOT included in the Git repository. Large Files: The project contains >5GB of data files that are NOT included in the Git repository.
Run the setup script to check data files:
./setup_data.shBackend:
cd backend
pip install -r requirements.txtFrontend:
cd frontend
npm installData:
./setup_data.shThe system now includes AI-powered features for task classification and GitHub repository matching.
Setup API Keys:
cd backend
cp .env.example .env
# Edit .env with your API keysRequired Environment Variables:
# Choose your preferred LLM provider
PREFERRED_LLM_PROVIDER=gemini # Options: openai, anthropic, gemini
# Add at least one API key:
OPENAI_API_KEY=your_openai_api_key # Optional
ANTHROPIC_API_KEY=your_anthropic_api_key # Optional
GOOGLE_API_KEY=your_google_api_key # Recommended (free tier available)
# Optional: GitHub token for enhanced repository search
GITHUB_TOKEN=your_github_personal_token # Increases rate limitsUpdate Database Schema:
cd backend
sqlite3 paperswithcode.db < scripts/add_llm_columns.sqlBenefits:
- 🎯 Better Task Classification: AI analyzes paper content vs simple keyword matching
- 🔍 Smarter GitHub Discovery: Multi-strategy search with confidence scoring
- ✨ Visual Indicators: UI shows AI-enhanced content with confidence levels
- 📈 Improved Accuracy: Higher precision in categorization and code linking
Use the startup script to launch everything automatically (includes arXiv paper updates):
./start.shThis script will:
- ✅ Check prerequisites (Python, Node.js)
- ✅ Check and free up required ports (8003, 3000)
- ✅ Initialize database if needed
- 🆕 Update latest arXiv papers to
data/newPaper.json - ✅ Start backend server on port 8003
- ✅ Start frontend server on port 3000
Backend Server:
cd backend
python full_api_server.pyVisit http://localhost:8003
Frontend Server:
cd frontend
npm startVisit http://localhost:3000
Username: demo
Password: Demo123456
paperswithcode/
├── backend/
│ ├── full_api_server.py # FastAPI main server
│ ├── paperswithcode.db # SQLite database (1.2GB)
│ ├── requirements.txt # Python dependencies
│ ├── services/ # LLM services
│ │ ├── llm_task_classifier.py # AI task classification
│ │ └── llm_github_matcher.py # AI GitHub repository matching
│ └── scripts/ # Data import and processing scripts
│ ├── daily_arxiv_update.py # ArXiv paper fetching with LLM
│ └── enhance_existing_papers.py # Batch enhance existing papers
│
├── frontend/
│ ├── src/
│ │ ├── pages/ # Page components
│ │ │ ├── HomePage.tsx # Home page
│ │ │ ├── PapersPage.tsx # Papers list
│ │ │ └── TasksPage.tsx # Tasks list
│ │ ├── components/ # Common components
│ │ └── services/ # API services
│ ├── package.json # Node dependencies
│ └── vite.config.ts # Vite configuration
│
└── data/ # Raw data files
├── papers.json.gz # Paper data
└── evaluation-tables.json # Leaderboard data
The project includes an arXiv paper fetching system that can automatically update with the latest ML papers.
Every time you run ./start.sh, it automatically fetches the latest papers and saves them to data/newPaper.json.
Option 1: Using Cron (Recommended)
cd backend/scripts
./setup_daily_cron.shThis sets up automatic updates every day at 2:00 AM.
Option 2: Using systemd timer
cd backend/scripts
./setup_systemd_timer.shTo manually fetch the latest papers:
cd backend/scripts
python daily_arxiv_update.py --max-results 50 --days-back 7Parameters:
--max-results: Number of papers to fetch per category (default: 50)--days-back: How many days back to search (default: 1)--config: Custom configuration file (default: arxiv_config.yaml)
For papers already in your database that lack proper tasks or GitHub links:
cd backend/scripts
# Preview what would be enhanced
python enhance_existing_papers.py --dry-run --max-papers 10
# Actually enhance papers with AI
python enhance_existing_papers.py --max-papers 50 --min-confidence 0.3This will:
- 🏷️ Add AI-classified tasks to papers missing specific tasks
- 🔗 Find GitHub repositories for papers without code links
- 📊 Provide confidence scores and reasoning for each enhancement
- ✨ Update UI indicators to show AI-enhanced content
Edit backend/scripts/arxiv_config.yaml to customize:
- Research areas to track
- Search keywords and filters
- Number of results per area
The fetched papers include:
- Title, abstract, authors
- ArXiv ID and PDF links
- 🤖 AI-powered task classification with confidence scores
- 🔗 Intelligent GitHub repository matching using LLM analysis
- Publication date and venue extraction
GET /api/v1/papers- Get paper list (supports search and pagination)GET /api/v1/papers/{id}- Get paper detailsGET /api/v1/github/stars- Get GitHub repository star count
GET /api/v1/tasks- Get task list (supports filtering by area)GET /api/v1/tasks/{id}- Get task detailsGET /api/v1/areas- Get research areas list
GET /api/v1/leaderboards/{id}- Get leaderboard for specific task
The project requires these large files (not in Git): The project requires these large files (not in Git):
| File | Size | Description |
|---|---|---|
| paperswithcode.db | ~1.3GB | Main database |
| paperswithcode_full.db | ~1.1GB | Full database |
| papers-with-abstracts.json | ~2.2GB | Paper data |
| evaluation-tables.json | ~252MB | Leaderboard data |
| links-between-papers-and-code.json | ~155MB | Code links |
-
Git LFS (Recommended for team collaboration)
git lfs install git lfs track "*.db" git lfs track "data/*.json"
-
Cloud Storage
- Google Drive / Dropbox
- AWS S3 / Azure Blob
- GitHub Releases (max 2GB per file)
-
Data Download
- Raw data: https://paperswithcode.com/datasets
- Contact maintainer for preprocessed database
- Total Papers: 575,626
- Total Tasks: 3,957
- Implementations: 218,852
- Leaderboards: 11,736
- Research Areas: 16 main categories
- Computer Vision (1,487 tasks)
- Natural Language Processing (987 tasks)
- Reinforcement Learning (456 tasks)
- Medical (387 tasks)
- More...
docker-compose up --buildThe application will start on the following ports:
- Frontend: http://localhost:3000
- Backend: http://localhost:8000
- Python 3.8+
- Node.js 16+
- At least 2GB available disk space (for database)
- OpenAI API key (for GPT-based classification)
- Anthropic API key (for Claude-based classification)
- Google API key (for Gemini-based classification) - Recommended
- GitHub Personal Access Token (for enhanced repository search)
IMPORTANT: All sensitive variables have been replaced with placeholders. You must configure these before deployment:
-
Database Configuration (
docker-compose.yml):POSTGRES_USER=your_db_username # Line 7 POSTGRES_PASSWORD=your_db_password # Line 8
-
JWT Secret (
docker-compose.yml,ecosystem.config.js):SECRET_KEY=your_secret_key # Must be a secure random string -
Server IP Address (Replace in all files):
docker-compose.ymlline 92:VITE_API_URL=http://your_server_ip:8003frontend/src/config/api.tsline 4:API_BASE_URL=http://your_server_ip:8003frontend/vite.config.tsline 11:allowedHosts=[..., 'your_server_ip']backend/app/core/config.pyline 25:CORS_ORIGINS=[..., 'http://your_server_ip', ...]
-
Google OAuth Configuration (
frontend/src/config/api.ts):VITE_GOOGLE_CLIENT_ID=your_google_client_id # Line 16
# Create backend/.env file with:
OPENAI_API_KEY=your_openai_api_key # For GPT-based features
ANTHROPIC_API_KEY=your_anthropic_api_key # For Claude-based features
GOOGLE_API_KEY=your_google_api_key # For Gemini-based features
MINIMAX_API_KEY=your_minimax_api_key # For MiniMax-based features
MINIMAX_GROUP_ID=your_minimax_group_id # MiniMax group ID
GITHUB_TOKEN=your_github_personal_token # For enhanced repository searchAll files (images, papers data) are now stored locally in the data/ directory:
- Images:
data/images/(automatically created) - Papers:
data/directory (arxiv_papers_*.json files)
- Never commit real API keys or passwords to version control
- Use strong, unique passwords for database users
- Generate a secure JWT secret (minimum 32 characters)
- Restrict CORS origins to your actual domain in production
- Use environment variables for all sensitive configuration in production
MIT License
Issues and Pull Requests are welcome!