Sotopia-RL: Reward Design for Social Intelligence

This repository contains the Sotopia-RL model, which is a fine-tuned Large Language Model (LLM) designed for enhanced social intelligence.

Model Details

Model Description

Sotopia-RL is a novel framework that refines coarse episode-level feedback from social interactions into utterance-level, multi-dimensional rewards. This approach makes reinforcement learning (RL) more efficient and stable for training socially intelligent agents by addressing challenges like partial observability (where utterances have indirect and delayed effects) and multi-dimensionality (where behaviors like rapport-building or knowledge-seeking contribute indirectly to goal achievement). Experiments show that Sotopia-RL achieves state-of-the-art social goal completion scores in the Sotopia environment.

Developed by: The authors of the Sotopia-RL paper
Model type: Fine-tuned Large Language Model (LLM) for social intelligence, implemented as a LoRA adapter.
Language(s) (NLP): English
License: Apache 2.0
Finetuned from model: Qwen/Qwen2.5-7B-Instruct

Model Sources

Repository: https://github.com/Sotopia-RL/Sotopia-RL
Paper: https://huggingface.co/papers/2508.03905
Project Page: https://rl.sotopia.world

Uses

Direct Use

Sotopia-RL is intended for research and development in training socially intelligent agents, particularly in interactive environments where LLMs need to learn complex social strategies directly through reinforcement learning. It can be used for tasks requiring nuanced social interaction, dialogue generation, and goal-oriented conversations in simulated settings.

Downstream Use

This model can be integrated into larger AI systems for applications such as:

Developing conversational AI with improved social awareness and response generation.
Creating intelligent agents for simulations, games, or educational tools that require social reasoning.
Research into reinforcement learning from human feedback or sophisticated reward design for complex, multi-faceted tasks.

Out-of-Scope Use

This model is not intended for:

Deployment in critical real-world scenarios without further rigorous safety and ethical evaluations.
Generating harmful, biased, or unethical content. Users are responsible for ensuring ethical deployment.
General text generation tasks without explicit social interaction context, as its fine-tuning is specialized.

Bias, Risks, and Limitations

The model's performance and behavior are influenced by its training data and the underlying base model. While the multi-dimensional reward design aims to mitigate reward hacking and promote more aligned behaviors, potential risks such as unintended biases, generation of misleading information, or emergent undesirable behaviors still exist. Thorough evaluation and human oversight are recommended for any real-world applications.

Recommendations

Users should be aware of the risks, biases and limitations of the model. It is recommended to:

Conduct thorough testing on specific use cases and datasets before deployment.
Implement human-in-the-loop validation for critical applications.
Monitor model outputs for unexpected or harmful behaviors.
Adhere to ethical AI guidelines and responsible development practices.

How to Get Started with the Model

You can load the base model and the PEFT adapter using the transformers and peft libraries and then merge them for inference.

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig
import torch

# Load the PEFT configuration
peft_model_id = "Sotopia-RL/Sotopia-RL-Qwen2.5-7B-Instruct" # Replace with the actual repo ID if different
peft_config = PeftConfig.from_pretrained(peft_model_id)

# Load the base model
model = AutoModelForCausalLM.from_pretrained(peft_config.base_model_name_or_path, torch_dtype=torch.bfloat16, device_map="auto")

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(peft_config.base_model_name_or_path)

# Load the PEFT model
model = PeftModel.from_pretrained(model, peft_model_id)

# Merge LoRA layers with base model and unload LoRA
model = model.merge_and_unload()

# Example inference
messages = [
    {"role": "user", "content": "Hello, how are you today? What's your purpose?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
)

decoded_output = tokenizer.batch_decode(generated_ids[:, model_inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(decoded_output)

Training Details

Training Data

The model was trained and evaluated within the Sotopia, an open-ended social learning environment. The framework refines coarse episode-level feedback into utterance-level, multi-dimensional rewards, which form the basis of the training signals. Specific details about proprietary or custom datasets used for fine-tuning are available in the paper.

Training Procedure

Sotopia-RL employs a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards for reinforcement learning (RL) training. This approach specifically addresses partial observability and multi-dimensionality in social interactions, allowing models to learn sophisticated strategies directly through social interactions.

Training Hyperparameters

The model is trained using a novel reward design framework, Sotopia-RL. Specific hyperparameters are detailed in the accompanying paper and codebase.

Training regime

Training regime: Reinforcement learning (RL) with utterance-level, multi-dimensional rewards.

Evaluation

Testing Data, Factors & Metrics

Evaluations were conducted on existing social intelligence benchmarks within the Sotopia environment.

Testing Data

Experiments were performed in "Sotopia, an open-ended social learning environment," and evaluated on "Sotopia-hard and Sotopia-full" benchmarks.

Metrics

"Social goal completion scores" were used as a primary metric.

Results

Sotopia-RL achieved significant improvements, demonstrating state-of-the-art social goal completion scores:

Sotopia-hard: 7.17
Sotopia-full: 8.31

These scores significantly outperform existing approaches. Ablation studies confirmed the necessity of both utterance-level credit assignment and multi-dimensional reward design for effective RL training.

Citation

If you find our work helpful or inspiring, please feel free to cite it.

@article{ji2025sotopia,
  title={Sotopia-RL: Reward Design for Social Intelligence},
  author={Ji, Zhoujun and Zhang, Wenqi and Chen, Yutong and Yang, Min and Li, Jiateng and Li, Chen and He, Shuyan and Wei, Yiran and Wang, Tianyi and Xu, Shuyin and Lin, Min and Tian, Tian},
  journal={arXiv preprint arXiv:2508.03905},
  year={2025}
}