Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

Qwen Large Model Application Team, Alibaba

In this work, we introduce Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance, a novel framework to mitigate format biases (e.g., length, lists, bolding) in reward models (RMs) for large language models. Our approach minimizes the mutual information between the difference in response representations and the relative bias attributes. This is achieved by training a variational network adversarially against the RM's encoder, encouraging it to learn representations that are invariant to spurious format correlations while retaining true preference signals.

⚙️ 1. Setup and Installation

First, we recommend creating a Conda virtual environment and installing the required dependencies.

# Create and activate the conda environment
conda create -n dir python=3.9
conda activate dir

# Install all other dependencies
pip install -r requirements.txt

📥 2. Data and Model Preparation

We provide convenient scripts to download all the necessary datasets (e.g., Skywork-preference-70K-v0.2, bias evaluation sets) and the base models (e.g., Llama-3-8B-Instruct) used in our experiments.

Run the following commands from the project's root directory:

# Navigate to the scripts directory
cd scripts

# Download all required datasets
bash auto_download_data.sh

# Download the base language models
bash auto_download_model.sh

After the scripts complete, your data and models will be organized in the designated directories.

🚀 3. Training and Evaluation Pipeline

The full experimental pipeline consists of three main stages: training the debiased reward model, aligning a policy model using PPO, and evaluating the final policy.

Step 3.1: Train the Debiased Reward Model (DIR)

To train our debiased reward model using the DIR framework, run the train_debias_rm.sh script. This script orchestrates the training process defined in reward_models/run_debias_reward_models_train.py.

# Make sure you are in the scripts/ directory
bash train_debias_rm.sh

The training logs and final RM checkpoints will be saved to the output directory specified within the script (e.g., ../exp/debiased_rm).

Step 3.2: Align a Policy with PPO

Once the debiased RM is trained, we use it to provide rewards for aligning a policy model with Proximal Policy Optimization (PPO) using the ms-swift evaluation tool.

Important: Before running, you must edit ms_ppo_script.sh and update the REWARD_MODEL_PATH variable to point to the checkpoint of the debiased RM you trained in the previous step. Plase make sure you have cloned the MS-Swift successfully.

# Example modification inside ms_ppo_script.sh:
# REWARD_MODEL_PATH="../exp/debiased_rm/checkpoint-final"

# Run the PPO training script
bash ms_ppo_script.sh

This will train a policy model and save the checkpoints to the specified output directory.

Step 3.3: Evaluate the Final Aligned Policy

Finally, we evaluate the performance of the PPO-aligned policy model on various benchmarks using the evalscope evaluation tool.

Important: Before running, you must edit rm_eval/evalscope_evaluation_script.sh and update the MODEL_PATH variable to point to the PPO-aligned model checkpoint from Step 3.2. Plase make sure you have cloned the EvalScope successfully.

# From the root directory, run the evaluation script
bash rm_eval/evalscope_evaluation_script.sh

The script will generate responses for the benchmark prompts and compute the final evaluation scores, saving the results to the specified output directory.

📁 Repository Structure

.
├── deepspeed_configs/     # DeepSpeed configuration files
├── reward_models/         # Core logic for training all reward models
│   ├── debias_trainer.py  # Trainer implementing the DIR framework
│   └── run_debias_reward_models_train.py # Main script to launch DIR training
├── rm_eval/               # Scripts for evaluating reward and policy models
│   ├── eval_biasbench.py  # Evaluate format bias
│   └── evalscope_evaluation_script.sh # Evaluate policy performance
├── scripts/               # Main workflow orchestration scripts
│   ├── auto_download_data.sh
│   ├── auto_download_model.sh
│   ├── train_debias_rm.sh # Use this to train our model
│   └── ms_ppo_script.sh   # Use this for PPO alignment
├── requirements.txt       # Project dependencies
└── README.md              # This file

🙏 Acknowledgements

This project is built upon several fantastic open-source libraries. We would like to extend our heartfelt gratitude to the developers and communities of:

Hugging Face Transformers for providing easy access to state-of-the-art models.
Hugging Face TRL for the robust RewardTrainer which served as the foundation for our DebiasTrainer.
DeepSpeed for enabling efficient large-model training.
ModelScope Swift and EvalScope for the powerful PPO and evaluation frameworks.
The Generalizable-Reward-Model repository for providing a useful data processing script.

📜 Citation

If you find our work useful in your research, please consider citing our paper:

@misc{li2025eliminatinginductivebiasreward,
      title={Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance}, 
      author={Zhuo Li and Pengyu Cheng and Zhechao Yu and Feifei Tong and Anningzhe Gao and Tsung-Hui Chang and Xiang Wan and Erchao Zhao and Xiaoxi Jiang and Guanjun Jiang},
      year={2025},
      eprint={2512.23461},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.23461}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

⚙️ 1. Setup and Installation

📥 2. Data and Model Preparation

🚀 3. Training and Evaluation Pipeline

Step 3.1: Train the Debiased Reward Model (DIR)

Step 3.2: Align a Policy with PPO

Step 3.3: Evaluate the Final Aligned Policy

📁 Repository Structure

🙏 Acknowledgements

📜 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
DPO		DPO
deepspeed_configs		deepspeed_configs
reward_models		reward_models
rm_eval		rm_eval
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

Qwen-Applications/DIR

Folders and files

Latest commit

History

Repository files navigation

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

⚙️ 1. Setup and Installation

📥 2. Data and Model Preparation

🚀 3. Training and Evaluation Pipeline

Step 3.1: Train the Debiased Reward Model (DIR)

Step 3.2: Align a Policy with PPO

Step 3.3: Evaluate the Final Aligned Policy

📁 Repository Structure

🙏 Acknowledgements

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages