📚 PaperCode - Papers with Code Clone

A full-featured Papers with Code clone with 575,626 papers, 3,957 tasks, 11,736 leaderboards, and 218,852 implementations.

✨ Features

Core Features

📚 575,626 Machine Learning Papers - Complete paper database with search and browsing
🎯 3,957 Machine Learning Tasks - Categorized by research areas with hierarchical structure
🏆 11,736 Leaderboard System - SOTA results and model rankings for each task
⭐ Real-time GitHub Stars - Automatically fetch repository star counts
💻 218,852 Code Implementation Links - 218,852 code implementations

Social Features

💬 Comment System - Nested comments with reply support
👥 User Following - Follow other researchers
📁 Categorized Collections - Custom collections to manage papers
📊 Reading History - Automatically track browsing history
📰 Activity Feed - View activities from followed users

AI Features

🤖 AI Assistant - ChatGPT integration for paper analysis
🧠 Smart Insights - AI-generated paper insights
💡 Concept Explanation - AI explains research concepts
✨ LLM Task Classification - Intelligent paper categorization using GPT/Claude/Gemini
🔗 Smart GitHub Matching - AI-powered repository discovery for papers

User System

🔐 JWT Authentication - Secure user authentication
👤 User Profiles - Personal pages and statistics
🎨 Modern UI - Responsive design with beautiful interface

Tech Stack

Backend

FastAPI - High-performance Python web framework
SQLite - Lightweight database (includes 1.2GB complete data)
Python 3.8+ - Backend development language

Frontend

React 18 - Frontend framework
TypeScript - Type-safe JavaScript
Tailwind CSS - Utility-first CSS framework
Vite - Fast frontend build tool
KaTeX - LaTeX math formula rendering

🚀 Quick Start

🌟 Production Deployment with PM2 (Recommended for Servers)

For production servers, use PM2 for process management and automatic restarts:

# Install PM2 globally (if not installed)
npm install -g pm2

# Start all services with PM2
./start-pm2.sh

# Or manually with:
pm2 start ecosystem.config.js

PM2 Features:

✅ Automatic restart on crash
✅ Log management and rotation
✅ Memory limit auto-restart
✅ Daily arXiv updates (2 AM cron job)
✅ System startup persistence

PM2 Management Commands:

pm2 status              # View service status
pm2 logs                # View all logs
pm2 logs papercode-backend   # View backend logs
pm2 logs papercode-frontend  # View frontend logs
pm2 restart all         # Restart all services
pm2 stop all           # Stop all services
pm2 monit              # Real-time monitoring
pm2 startup            # Setup auto-start on reboot
pm2 save               # Save current process list

⚠️ Important Notice

Large Files: The project contains >5GB of data files that are NOT included in the Git repository. Large Files: The project contains >5GB of data files that are NOT included in the Git repository.

Run the setup script to check data files:

./setup_data.sh

1. Install Dependencies

Backend:

cd backend
pip install -r requirements.txt

Frontend:

cd frontend
npm install

Data:

./setup_data.sh

1.5. Configure LLM Integration (Optional but Recommended)

The system now includes AI-powered features for task classification and GitHub repository matching.

Setup API Keys:

cd backend
cp .env.example .env
# Edit .env with your API keys

Required Environment Variables:

# Choose your preferred LLM provider
PREFERRED_LLM_PROVIDER=gemini  # Options: openai, anthropic, gemini

# Add at least one API key:
OPENAI_API_KEY=your_openai_api_key          # Optional
ANTHROPIC_API_KEY=your_anthropic_api_key    # Optional  
GOOGLE_API_KEY=your_google_api_key          # Recommended (free tier available)

# Optional: GitHub token for enhanced repository search
GITHUB_TOKEN=your_github_personal_token     # Increases rate limits

Update Database Schema:

cd backend
sqlite3 paperswithcode.db < scripts/add_llm_columns.sql

Benefits:

🎯 Better Task Classification: AI analyzes paper content vs simple keyword matching
🔍 Smarter GitHub Discovery: Multi-strategy search with confidence scoring
✨ Visual Indicators: UI shows AI-enhanced content with confidence levels
📈 Improved Accuracy: Higher precision in categorization and code linking

2. Start Application

🎯 Recommended: One-Command Startup

Use the startup script to launch everything automatically (includes arXiv paper updates):

./start.sh

This script will:

✅ Check prerequisites (Python, Node.js)
✅ Check and free up required ports (8003, 3000)
✅ Initialize database if needed
🆕 Update latest arXiv papers to data/newPaper.json
✅ Start backend server on port 8003
✅ Start frontend server on port 3000

Manual Startup (Alternative)

Backend Server:

cd backend
python full_api_server.py

Visit http://localhost:8003

Frontend Server:

cd frontend
npm start

Visit http://localhost:3000

3. Test Account

Username: demo
Password: Demo123456

Project Structure

paperswithcode/
├── backend/
│   ├── full_api_server.py      # FastAPI main server
│   ├── paperswithcode.db        # SQLite database (1.2GB)
│   ├── requirements.txt         # Python dependencies
│   ├── services/               # LLM services
│   │   ├── llm_task_classifier.py    # AI task classification
│   │   └── llm_github_matcher.py     # AI GitHub repository matching
│   └── scripts/                 # Data import and processing scripts
│       ├── daily_arxiv_update.py     # ArXiv paper fetching with LLM
│       └── enhance_existing_papers.py # Batch enhance existing papers
│
├── frontend/
│   ├── src/
│   │   ├── pages/              # Page components
│   │   │   ├── HomePage.tsx    # Home page
│   │   │   ├── PapersPage.tsx  # Papers list
│   │   │   └── TasksPage.tsx   # Tasks list
│   │   ├── components/         # Common components
│   │   └── services/           # API services
│   ├── package.json            # Node dependencies
│   └── vite.config.ts          # Vite configuration
│
└── data/                       # Raw data files
    ├── papers.json.gz          # Paper data
    └── evaluation-tables.json  # Leaderboard data

📰 ArXiv Paper Updates

Automatic Daily Updates

The project includes an arXiv paper fetching system that can automatically update with the latest ML papers.

On Startup (Already Configured)

Every time you run ./start.sh, it automatically fetches the latest papers and saves them to data/newPaper.json.

Set Up Daily Auto-Update (Optional)

Option 1: Using Cron (Recommended)

cd backend/scripts
./setup_daily_cron.sh

This sets up automatic updates every day at 2:00 AM.

Option 2: Using systemd timer

cd backend/scripts
./setup_systemd_timer.sh

Manual Update

To manually fetch the latest papers:

cd backend/scripts
python daily_arxiv_update.py --max-results 50 --days-back 7

Parameters:

--max-results: Number of papers to fetch per category (default: 50)
--days-back: How many days back to search (default: 1)
--config: Custom configuration file (default: arxiv_config.yaml)

🤖 Enhance Existing Papers with AI

For papers already in your database that lack proper tasks or GitHub links:

cd backend/scripts

# Preview what would be enhanced
python enhance_existing_papers.py --dry-run --max-papers 10

# Actually enhance papers with AI
python enhance_existing_papers.py --max-papers 50 --min-confidence 0.3

This will:

🏷️ Add AI-classified tasks to papers missing specific tasks
🔗 Find GitHub repositories for papers without code links
📊 Provide confidence scores and reasoning for each enhancement
✨ Update UI indicators to show AI-enhanced content

Configuration

Edit backend/scripts/arxiv_config.yaml to customize:

Research areas to track
Search keywords and filters
Number of results per area

The fetched papers include:

Title, abstract, authors
ArXiv ID and PDF links
🤖 AI-powered task classification with confidence scores
🔗 Intelligent GitHub repository matching using LLM analysis
Publication date and venue extraction

API Endpoints

Paper Related

GET /api/v1/papers - Get paper list (supports search and pagination)
GET /api/v1/papers/{id} - Get paper details
GET /api/v1/github/stars - Get GitHub repository star count

Task Related

GET /api/v1/tasks - Get task list (supports filtering by area)
GET /api/v1/tasks/{id} - Get task details
GET /api/v1/areas - Get research areas list

Leaderboard Related

GET /api/v1/leaderboards/{id} - Get leaderboard for specific task

📊 Data Management

Large Files

The project requires these large files (not in Git): The project requires these large files (not in Git):

File	Size	Description
paperswithcode.db	~1.3GB	Main database
paperswithcode_full.db	~1.1GB	Full database
papers-with-abstracts.json	~2.2GB	Paper data
evaluation-tables.json	~252MB	Leaderboard data
links-between-papers-and-code.json	~155MB	Code links

Solutions

Git LFS (Recommended for team collaboration)

git lfs install
git lfs track "*.db"
git lfs track "data/*.json"

Cloud Storage
- Google Drive / Dropbox
- AWS S3 / Azure Blob
- GitHub Releases (max 2GB per file)
Data Download
- Raw data: https://paperswithcode.com/datasets
- Contact maintainer for preprocessed database

📈 Statistics

Total Papers: 575,626
Total Tasks: 3,957
Implementations: 218,852
Leaderboards: 11,736
Research Areas: 16 main categories
- Computer Vision (1,487 tasks)
- Natural Language Processing (987 tasks)
- Reinforcement Learning (456 tasks)
- Medical (387 tasks)
- More...

Docker Deployment

docker-compose up --build

The application will start on the following ports:

Frontend: http://localhost:3000
Backend: http://localhost:8000

Requirements

Python 3.8+
Node.js 16+
At least 2GB available disk space (for database)

Optional (for AI features):

OpenAI API key (for GPT-based classification)
Anthropic API key (for Claude-based classification)
Google API key (for Gemini-based classification) - Recommended
GitHub Personal Access Token (for enhanced repository search)

🔐 Security Configuration

IMPORTANT: All sensitive variables have been replaced with placeholders. You must configure these before deployment:

Required Environment Variables:

Database Configuration (docker-compose.yml):

POSTGRES_USER=your_db_username          # Line 7
POSTGRES_PASSWORD=your_db_password      # Line 8

JWT Secret (docker-compose.yml, ecosystem.config.js):

SECRET_KEY=your_secret_key              # Must be a secure random string

Server IP Address (Replace in all files):
- docker-compose.yml line 92: VITE_API_URL=http://your_server_ip:8003
- frontend/src/config/api.ts line 4: API_BASE_URL=http://your_server_ip:8003
- frontend/vite.config.ts line 11: allowedHosts=[..., 'your_server_ip']
- backend/app/core/config.py line 25: CORS_ORIGINS=[..., 'http://your_server_ip', ...]

Google OAuth Configuration (frontend/src/config/api.ts):

VITE_GOOGLE_CLIENT_ID=your_google_client_id    # Line 16

Optional AI API Keys (`.env` file in backend/):

# Create backend/.env file with:
OPENAI_API_KEY=your_openai_api_key          # For GPT-based features
ANTHROPIC_API_KEY=your_anthropic_api_key    # For Claude-based features  
GOOGLE_API_KEY=your_google_api_key          # For Gemini-based features
MINIMAX_API_KEY=your_minimax_api_key        # For MiniMax-based features
MINIMAX_GROUP_ID=your_minimax_group_id      # MiniMax group ID
GITHUB_TOKEN=your_github_personal_token     # For enhanced repository search

Local Storage Configuration:

All files (images, papers data) are now stored locally in the data/ directory:

Images: data/images/ (automatically created)
Papers: data/ directory (arxiv_papers_*.json files)

Security Notes:

Never commit real API keys or passwords to version control
Use strong, unique passwords for database users
Generate a secure JWT secret (minimum 32 characters)
Restrict CORS origins to your actual domain in production
Use environment variables for all sensitive configuration in production

License

MIT License

Contributing

Issues and Pull Requests are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
data		data
frontend		frontend
.gitattributes		.gitattributes
.gitattributes.backup		.gitattributes.backup
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
GIT_LFS_SETUP.md		GIT_LFS_SETUP.md
OPTIMIZATIONS.md		OPTIMIZATIONS.md
README.md		README.md
README_NEW.md		README_NEW.md
docker-compose.yml		docker-compose.yml
ecosystem.config.js		ecosystem.config.js
papercode.nginx		papercode.nginx
push_to_github.sh		push_to_github.sh
setup_data.sh		setup_data.sh
start-pm2.sh		start-pm2.sh
start.sh		start.sh
test_ai_summary.py		test_ai_summary.py

Pi3AI/ScholarWiki

Folders and files

Latest commit

History

Repository files navigation