Skip to content

Qwen-Applications/DIR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

Github arXiv Github License

Qwen Large Model Application Team, Alibaba

In this work, we introduce Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance, a novel framework to mitigate format biases (e.g., length, lists, bolding) in reward models (RMs) for large language models. Our approach minimizes the mutual information between the difference in response representations and the relative bias attributes. This is achieved by training a variational network adversarially against the RM's encoder, encouraging it to learn representations that are invariant to spurious format correlations while retaining true preference signals.

⚙️ 1. Setup and Installation

First, we recommend creating a Conda virtual environment and installing the required dependencies.

# Create and activate the conda environment
conda create -n dir python=3.9
conda activate dir

# Install all other dependencies
pip install -r requirements.txt

📥 2. Data and Model Preparation

We provide convenient scripts to download all the necessary datasets (e.g., Skywork-preference-70K-v0.2, bias evaluation sets) and the base models (e.g., Llama-3-8B-Instruct) used in our experiments.

Run the following commands from the project's root directory:

# Navigate to the scripts directory
cd scripts

# Download all required datasets
bash auto_download_data.sh

# Download the base language models
bash auto_download_model.sh

After the scripts complete, your data and models will be organized in the designated directories.

🚀 3. Training and Evaluation Pipeline

The full experimental pipeline consists of three main stages: training the debiased reward model, aligning a policy model using PPO, and evaluating the final policy.

Step 3.1: Train the Debiased Reward Model (DIR)

To train our debiased reward model using the DIR framework, run the train_debias_rm.sh script. This script orchestrates the training process defined in reward_models/run_debias_reward_models_train.py.

# Make sure you are in the scripts/ directory
bash train_debias_rm.sh

The training logs and final RM checkpoints will be saved to the output directory specified within the script (e.g., ../exp/debiased_rm).

Step 3.2: Align a Policy with PPO

Once the debiased RM is trained, we use it to provide rewards for aligning a policy model with Proximal Policy Optimization (PPO) using the ms-swift evaluation tool.

Important: Before running, you must edit ms_ppo_script.sh and update the REWARD_MODEL_PATH variable to point to the checkpoint of the debiased RM you trained in the previous step. Plase make sure you have cloned the MS-Swift successfully.

# Example modification inside ms_ppo_script.sh:
# REWARD_MODEL_PATH="../exp/debiased_rm/checkpoint-final"

# Run the PPO training script
bash ms_ppo_script.sh

This will train a policy model and save the checkpoints to the specified output directory.

Step 3.3: Evaluate the Final Aligned Policy

Finally, we evaluate the performance of the PPO-aligned policy model on various benchmarks using the evalscope evaluation tool.

Important: Before running, you must edit rm_eval/evalscope_evaluation_script.sh and update the MODEL_PATH variable to point to the PPO-aligned model checkpoint from Step 3.2. Plase make sure you have cloned the EvalScope successfully.

# From the root directory, run the evaluation script
bash rm_eval/evalscope_evaluation_script.sh

The script will generate responses for the benchmark prompts and compute the final evaluation scores, saving the results to the specified output directory.

📁 Repository Structure

.
├── deepspeed_configs/     # DeepSpeed configuration files
├── reward_models/         # Core logic for training all reward models
│   ├── debias_trainer.py  # Trainer implementing the DIR framework
│   └── run_debias_reward_models_train.py # Main script to launch DIR training
├── rm_eval/               # Scripts for evaluating reward and policy models
│   ├── eval_biasbench.py  # Evaluate format bias
│   └── evalscope_evaluation_script.sh # Evaluate policy performance
├── scripts/               # Main workflow orchestration scripts
│   ├── auto_download_data.sh
│   ├── auto_download_model.sh
│   ├── train_debias_rm.sh # Use this to train our model
│   └── ms_ppo_script.sh   # Use this for PPO alignment
├── requirements.txt       # Project dependencies
└── README.md              # This file

🙏 Acknowledgements

This project is built upon several fantastic open-source libraries. We would like to extend our heartfelt gratitude to the developers and communities of:

📜 Citation

If you find our work useful in your research, please consider citing our paper:

@misc{li2025eliminatinginductivebiasreward,
      title={Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance}, 
      author={Zhuo Li and Pengyu Cheng and Zhechao Yu and Feifei Tong and Anningzhe Gao and Tsung-Hui Chang and Xiang Wan and Erchao Zhao and Xiaoxi Jiang and Guanjun Jiang},
      year={2025},
      eprint={2512.23461},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.23461}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published