Skip to content

Collect 56k+ papers with code, automatically update everyday. http://scholarwiki.ai

Notifications You must be signed in to change notification settings

Pi3AI/ScholarWiki

Repository files navigation

📚 PaperCode - Papers with Code Clone

A full-featured Papers with Code clone with 575,626 papers, 3,957 tasks, 11,736 leaderboards, and 218,852 implementations.

✨ Features

Core Features

  • 📚 575,626 Machine Learning Papers - Complete paper database with search and browsing
  • 🎯 3,957 Machine Learning Tasks - Categorized by research areas with hierarchical structure
  • 🏆 11,736 Leaderboard System - SOTA results and model rankings for each task
  • Real-time GitHub Stars - Automatically fetch repository star counts
  • 💻 218,852 Code Implementation Links - 218,852 code implementations

Social Features

  • 💬 Comment System - Nested comments with reply support
  • 👥 User Following - Follow other researchers
  • 📁 Categorized Collections - Custom collections to manage papers
  • 📊 Reading History - Automatically track browsing history
  • 📰 Activity Feed - View activities from followed users

AI Features

  • 🤖 AI Assistant - ChatGPT integration for paper analysis
  • 🧠 Smart Insights - AI-generated paper insights
  • 💡 Concept Explanation - AI explains research concepts
  • LLM Task Classification - Intelligent paper categorization using GPT/Claude/Gemini
  • 🔗 Smart GitHub Matching - AI-powered repository discovery for papers

User System

  • 🔐 JWT Authentication - Secure user authentication
  • 👤 User Profiles - Personal pages and statistics
  • 🎨 Modern UI - Responsive design with beautiful interface

Tech Stack

Backend

  • FastAPI - High-performance Python web framework
  • SQLite - Lightweight database (includes 1.2GB complete data)
  • Python 3.8+ - Backend development language

Frontend

  • React 18 - Frontend framework
  • TypeScript - Type-safe JavaScript
  • Tailwind CSS - Utility-first CSS framework
  • Vite - Fast frontend build tool
  • KaTeX - LaTeX math formula rendering

🚀 Quick Start

🌟 Production Deployment with PM2 (Recommended for Servers)

For production servers, use PM2 for process management and automatic restarts:

# Install PM2 globally (if not installed)
npm install -g pm2

# Start all services with PM2
./start-pm2.sh

# Or manually with:
pm2 start ecosystem.config.js

PM2 Features:

  • ✅ Automatic restart on crash
  • ✅ Log management and rotation
  • ✅ Memory limit auto-restart
  • ✅ Daily arXiv updates (2 AM cron job)
  • ✅ System startup persistence

PM2 Management Commands:

pm2 status              # View service status
pm2 logs                # View all logs
pm2 logs papercode-backend   # View backend logs
pm2 logs papercode-frontend  # View frontend logs
pm2 restart all         # Restart all services
pm2 stop all           # Stop all services
pm2 monit              # Real-time monitoring
pm2 startup            # Setup auto-start on reboot
pm2 save               # Save current process list

⚠️ Important Notice

Large Files: The project contains >5GB of data files that are NOT included in the Git repository. Large Files: The project contains >5GB of data files that are NOT included in the Git repository.

Run the setup script to check data files:

./setup_data.sh

1. Install Dependencies

Backend:

cd backend
pip install -r requirements.txt

Frontend:

cd frontend
npm install

Data:

./setup_data.sh

1.5. Configure LLM Integration (Optional but Recommended)

The system now includes AI-powered features for task classification and GitHub repository matching.

Setup API Keys:

cd backend
cp .env.example .env
# Edit .env with your API keys

Required Environment Variables:

# Choose your preferred LLM provider
PREFERRED_LLM_PROVIDER=gemini  # Options: openai, anthropic, gemini

# Add at least one API key:
OPENAI_API_KEY=your_openai_api_key          # Optional
ANTHROPIC_API_KEY=your_anthropic_api_key    # Optional  
GOOGLE_API_KEY=your_google_api_key          # Recommended (free tier available)

# Optional: GitHub token for enhanced repository search
GITHUB_TOKEN=your_github_personal_token     # Increases rate limits

Update Database Schema:

cd backend
sqlite3 paperswithcode.db < scripts/add_llm_columns.sql

Benefits:

  • 🎯 Better Task Classification: AI analyzes paper content vs simple keyword matching
  • 🔍 Smarter GitHub Discovery: Multi-strategy search with confidence scoring
  • Visual Indicators: UI shows AI-enhanced content with confidence levels
  • 📈 Improved Accuracy: Higher precision in categorization and code linking

2. Start Application

🎯 Recommended: One-Command Startup

Use the startup script to launch everything automatically (includes arXiv paper updates):

./start.sh

This script will:

  • ✅ Check prerequisites (Python, Node.js)
  • ✅ Check and free up required ports (8003, 3000)
  • ✅ Initialize database if needed
  • 🆕 Update latest arXiv papers to data/newPaper.json
  • ✅ Start backend server on port 8003
  • ✅ Start frontend server on port 3000

Manual Startup (Alternative)

Backend Server:

cd backend
python full_api_server.py

Visit http://localhost:8003

Frontend Server:

cd frontend
npm start

Visit http://localhost:3000

3. Test Account

Username: demo
Password: Demo123456

Project Structure

paperswithcode/
├── backend/
│   ├── full_api_server.py      # FastAPI main server
│   ├── paperswithcode.db        # SQLite database (1.2GB)
│   ├── requirements.txt         # Python dependencies
│   ├── services/               # LLM services
│   │   ├── llm_task_classifier.py    # AI task classification
│   │   └── llm_github_matcher.py     # AI GitHub repository matching
│   └── scripts/                 # Data import and processing scripts
│       ├── daily_arxiv_update.py     # ArXiv paper fetching with LLM
│       └── enhance_existing_papers.py # Batch enhance existing papers
│
├── frontend/
│   ├── src/
│   │   ├── pages/              # Page components
│   │   │   ├── HomePage.tsx    # Home page
│   │   │   ├── PapersPage.tsx  # Papers list
│   │   │   └── TasksPage.tsx   # Tasks list
│   │   ├── components/         # Common components
│   │   └── services/           # API services
│   ├── package.json            # Node dependencies
│   └── vite.config.ts          # Vite configuration
│
└── data/                       # Raw data files
    ├── papers.json.gz          # Paper data
    └── evaluation-tables.json  # Leaderboard data

📰 ArXiv Paper Updates

Automatic Daily Updates

The project includes an arXiv paper fetching system that can automatically update with the latest ML papers.

On Startup (Already Configured)

Every time you run ./start.sh, it automatically fetches the latest papers and saves them to data/newPaper.json.

Set Up Daily Auto-Update (Optional)

Option 1: Using Cron (Recommended)

cd backend/scripts
./setup_daily_cron.sh

This sets up automatic updates every day at 2:00 AM.

Option 2: Using systemd timer

cd backend/scripts
./setup_systemd_timer.sh

Manual Update

To manually fetch the latest papers:

cd backend/scripts
python daily_arxiv_update.py --max-results 50 --days-back 7

Parameters:

  • --max-results: Number of papers to fetch per category (default: 50)
  • --days-back: How many days back to search (default: 1)
  • --config: Custom configuration file (default: arxiv_config.yaml)

🤖 Enhance Existing Papers with AI

For papers already in your database that lack proper tasks or GitHub links:

cd backend/scripts

# Preview what would be enhanced
python enhance_existing_papers.py --dry-run --max-papers 10

# Actually enhance papers with AI
python enhance_existing_papers.py --max-papers 50 --min-confidence 0.3

This will:

  • 🏷️ Add AI-classified tasks to papers missing specific tasks
  • 🔗 Find GitHub repositories for papers without code links
  • 📊 Provide confidence scores and reasoning for each enhancement
  • ✨ Update UI indicators to show AI-enhanced content

Configuration

Edit backend/scripts/arxiv_config.yaml to customize:

  • Research areas to track
  • Search keywords and filters
  • Number of results per area

The fetched papers include:

  • Title, abstract, authors
  • ArXiv ID and PDF links
  • 🤖 AI-powered task classification with confidence scores
  • 🔗 Intelligent GitHub repository matching using LLM analysis
  • Publication date and venue extraction

API Endpoints

Paper Related

  • GET /api/v1/papers - Get paper list (supports search and pagination)
  • GET /api/v1/papers/{id} - Get paper details
  • GET /api/v1/github/stars - Get GitHub repository star count

Task Related

  • GET /api/v1/tasks - Get task list (supports filtering by area)
  • GET /api/v1/tasks/{id} - Get task details
  • GET /api/v1/areas - Get research areas list

Leaderboard Related

  • GET /api/v1/leaderboards/{id} - Get leaderboard for specific task

📊 Data Management

Large Files

The project requires these large files (not in Git): The project requires these large files (not in Git):

File Size Description
paperswithcode.db ~1.3GB Main database
paperswithcode_full.db ~1.1GB Full database
papers-with-abstracts.json ~2.2GB Paper data
evaluation-tables.json ~252MB Leaderboard data
links-between-papers-and-code.json ~155MB Code links

Solutions

  1. Git LFS (Recommended for team collaboration)

    git lfs install
    git lfs track "*.db"
    git lfs track "data/*.json"
  2. Cloud Storage

    • Google Drive / Dropbox
    • AWS S3 / Azure Blob
    • GitHub Releases (max 2GB per file)
  3. Data Download

📈 Statistics

  • Total Papers: 575,626
  • Total Tasks: 3,957
  • Implementations: 218,852
  • Leaderboards: 11,736
  • Research Areas: 16 main categories
    • Computer Vision (1,487 tasks)
    • Natural Language Processing (987 tasks)
    • Reinforcement Learning (456 tasks)
    • Medical (387 tasks)
    • More...

Docker Deployment

docker-compose up --build

The application will start on the following ports:

Requirements

  • Python 3.8+
  • Node.js 16+
  • At least 2GB available disk space (for database)

Optional (for AI features):

  • OpenAI API key (for GPT-based classification)
  • Anthropic API key (for Claude-based classification)
  • Google API key (for Gemini-based classification) - Recommended
  • GitHub Personal Access Token (for enhanced repository search)

🔐 Security Configuration

IMPORTANT: All sensitive variables have been replaced with placeholders. You must configure these before deployment:

Required Environment Variables:

  1. Database Configuration (docker-compose.yml):

    POSTGRES_USER=your_db_username          # Line 7
    POSTGRES_PASSWORD=your_db_password      # Line 8
  2. JWT Secret (docker-compose.yml, ecosystem.config.js):

    SECRET_KEY=your_secret_key              # Must be a secure random string
  3. Server IP Address (Replace in all files):

    • docker-compose.yml line 92: VITE_API_URL=http://your_server_ip:8003
    • frontend/src/config/api.ts line 4: API_BASE_URL=http://your_server_ip:8003
    • frontend/vite.config.ts line 11: allowedHosts=[..., 'your_server_ip']
    • backend/app/core/config.py line 25: CORS_ORIGINS=[..., 'http://your_server_ip', ...]
  4. Google OAuth Configuration (frontend/src/config/api.ts):

    VITE_GOOGLE_CLIENT_ID=your_google_client_id    # Line 16

Optional AI API Keys (.env file in backend/):

# Create backend/.env file with:
OPENAI_API_KEY=your_openai_api_key          # For GPT-based features
ANTHROPIC_API_KEY=your_anthropic_api_key    # For Claude-based features  
GOOGLE_API_KEY=your_google_api_key          # For Gemini-based features
MINIMAX_API_KEY=your_minimax_api_key        # For MiniMax-based features
MINIMAX_GROUP_ID=your_minimax_group_id      # MiniMax group ID
GITHUB_TOKEN=your_github_personal_token     # For enhanced repository search

Local Storage Configuration:

All files (images, papers data) are now stored locally in the data/ directory:

  • Images: data/images/ (automatically created)
  • Papers: data/ directory (arxiv_papers_*.json files)

Security Notes:

  • Never commit real API keys or passwords to version control
  • Use strong, unique passwords for database users
  • Generate a secure JWT secret (minimum 32 characters)
  • Restrict CORS origins to your actual domain in production
  • Use environment variables for all sensitive configuration in production

License

MIT License

Contributing

Issues and Pull Requests are welcome!

About

Collect 56k+ papers with code, automatically update everyday. http://scholarwiki.ai

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •