Tej3 commited on
Commit
621eb6c
·
verified ·
1 Parent(s): 88ce59e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +201 -0
README.md ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen2.5-Math-7B-Instruct
7
+ pipeline_tag: text-classification
8
+ library_name: transformers
9
+ ---
10
+
11
+ # PathFinder-PRM-7B
12
+
13
+
14
+ <div align="center">
15
+ <img src="images/PathFinder.png" width="300">
16
+ </div>
17
+
18
+
19
+ ## Introduction
20
+
21
+ PathFinder-PRM-7B is a hierarchical discriminative Process Reward Model(PRM) designed to identify errors and reward correct math reasoning in multi-step outputs from large language models (LLMs). Instead of treating evaluation as a single correct-or-wrong decision, PathFinder-PRM-7B breaks down its error judgment into 2 parts: whether the reasoning is mathematically correct, and logically consistent. It predicts these aspects separately and then combines them to decide if the current reasoning steps leads to a correct final solution. PathFinder-PRM-7B is trained on a combination of high-quality human annotated data (PRM800K) and additional automatically annotated samples, enabling robustness to common failure patterns and strong generalization across diverse benchmarks such as ProcessBench and PRMBench.
22
+
23
+ ## Model Details
24
+
25
+ ### Model Description
26
+
27
+ - **Model type:** Process Reward Model
28
+ - **Language(s) (NLP):** English
29
+ - **License:** MIT
30
+ - **Finetuned from model:** Qwen/Qwen2.5-Math-7B-Instruct
31
+
32
+ ### Model Sources
33
+
34
+ <!-- Provide the basic links for the model. -->
35
+
36
+ - **Repository:** https://github.com/declare-lab/PathFinder-PRM/tree/main
37
+ <!-- - **Paper:** [More Information Needed] -->
38
+
39
+ For more details, please refer to our paper and Github repository.
40
+
41
+ ## Usage
42
+
43
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
+
45
+ ### 🤗 Hugging Face Transformers
46
+ Here we show a code snippet to show you how to use the PathFinder-PRM-7B with transformers:
47
+
48
+ ```python
49
+ import torch
50
+ from transformers import AutoModel, AutoTokenizer
51
+ import torch.nn.functional as F
52
+
53
+ model_name = "declare-lab/PathFinder-PRM-7B"
54
+ device = "auto"
55
+
56
+ PROMPT_PREFIX = "You are a Math Teacher. Given a question and a student's solution, evaluate the mathemetical correctness, logic consistency of the current step and whether it will lead to the correct final solution"
57
+
58
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
59
+
60
+ model = AutoModel.from_pretrained(
61
+ model_name,
62
+ device_map=device,
63
+ torch_dtype=torch.bfloat16,
64
+ trust_remote_code=True,
65
+ attn_implementation = "flash_attention_2",
66
+ ).eval()
67
+
68
+ pos_token_id = tokenizer.encode("<+>")[0]
69
+ neg_token_id = tokenizer.encode("<->")[0]
70
+
71
+ def run_inference(sample_input):
72
+
73
+ message_ids = tokenizer.apply_chat_template(
74
+ sample_input,
75
+ tokenize=True,
76
+ return_dict=True,
77
+ return_tensors='pt'
78
+ ).to(model.device)
79
+
80
+ mask_token_id = tokenizer.encode("<extra>")[0]
81
+ token_masks = (message_ids['input_ids'] == mask_token_id)
82
+
83
+ shifted_mask = torch.cat(
84
+ [
85
+ token_masks[:, 1:],
86
+ torch.zeros(token_masks.size(0), 1, dtype=torch.bool, device=model.device)
87
+ ],
88
+ dim=1
89
+ )
90
+
91
+ # 1st Forward Pass
92
+ with torch.no_grad():
93
+ outputs = model(**message_ids)
94
+
95
+ allowed_token_ids = torch.tensor([pos_token_id, neg_token_id], device=outputs.logits.device)
96
+
97
+ masked_logits = outputs.logits[shifted_mask][:, allowed_token_ids]
98
+ predicted_indices = masked_logits.argmax(dim=-1)
99
+ predicted_tokens = allowed_token_ids[predicted_indices]
100
+
101
+ decoded_tokens = [tokenizer.decode([int(token_id)], skip_special_tokens=False) for token_id in predicted_tokens]
102
+
103
+ if '<->' in decoded_tokens:
104
+ # error found in step
105
+ return -1
106
+
107
+ # preparing input for 2nd Forward Pass
108
+ new_messages = sample_input.copy()
109
+
110
+ asst_response = new_messages[-1]['content']
111
+
112
+ # replacing mask tokens with pred tokens for math and consistency
113
+ for pred in decoded_tokens:
114
+ asst_response = asst_response.replace("<extra>", pred, 1)
115
+
116
+ asst_response += ', Correctness: <extra>'
117
+
118
+ new_messages[-1]['content'] = asst_response
119
+
120
+ new_message_ids = tokenizer.apply_chat_template(
121
+ new_messages,
122
+ tokenize=True,
123
+ return_dict=True,
124
+ return_tensors='pt'
125
+ ).to(model.device)
126
+
127
+ token_masks = (new_message_ids['input_ids'] == mask_token_id)
128
+
129
+ shifted_mask = torch.cat(
130
+ [
131
+ token_masks[:, 1:],
132
+ torch.zeros(token_masks.size(0), 1, dtype=torch.bool, device=model.device)
133
+ ],
134
+ dim=1
135
+ )
136
+
137
+ # 2nd Forward Pass
138
+ with torch.no_grad():
139
+ outputs = model(**new_message_ids)
140
+
141
+ masked_logits = outputs.logits[shifted_mask]
142
+
143
+ restricted_logits = masked_logits[:, [pos_token_id, neg_token_id]]
144
+
145
+ probs_pos_neg = F.softmax(restricted_logits, dim=-1)
146
+
147
+ return probs_pos_neg[0][0].cpu().item()
148
+
149
+ question = "Sue lives in a fun neighborhood. One weekend, the neighbors decided to play a prank on Sue. On Friday morning, the neighbors placed 18 pink plastic flamingos out on Sue's front yard. On Saturday morning, the neighbors took back one third of the flamingos, painted them white, and put these newly painted white flamingos back out on Sue's front yard. Then, on Sunday morning, they added another 18 pink plastic flamingos to the collection. At noon on Sunday, how many more pink plastic flamingos were out than white plastic flamingos?"
150
+
151
+ prev_steps = [ "To find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.",
152
+ "On Saturday, they take back one third of the flamingos. Since there were 18 flamingos, (1/3 \\times 18 = 6) flamingos are taken back. So, they have (18 - 6 = 12) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has (12 + 6 = 18) pink flamingos and 6 white flamingos.",
153
+ "On Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has (18 + 18 = 36) pink flamingos and still 6 white flamingos."]
154
+
155
+ curr_step = "To find the difference, subtract the number of white flamingos from the number of pink flamingos: (36 - 6 = 30). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is (\\boxed{30})."
156
+
157
+ prev_steps_str = "\n\n".join(prev_steps)
158
+ messages = [
159
+ {"role": "user", "content": PROMPT_PREFIX + "\n\n Question: "+ question},
160
+ {"role": "assistant", "content": prev_steps_str + "\n\nCurrent Step: " + now_step +" Math reasoning: <extra>, Consistency: <extra>"},
161
+ ]
162
+
163
+ reward_score = run_inference(messages)
164
+ ```
165
+
166
+ ## Evaluation
167
+
168
+
169
+ #### Evalaution Benchmarks
170
+
171
+ - [**ProcessBench**](https://huggingface.co/datasets/Qwen/ProcessBench)
172
+ - [**PRMBench**](https://github.com/ssmisya/PRMBench)
173
+ - [**Reward-Guided Greedy Search**](https://github.com/NJUNLP/R-PRM/tree/main/src/datasets)
174
+ - [MATH500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500)
175
+ - [AIME24](https://huggingface.co/datasets/math-ai/aime24)
176
+ - [AMC23](https://huggingface.co/datasets/math-ai/amc23)
177
+ - [Minerva Math](https://huggingface.co/datasets/math-ai/minervamath)
178
+ - [Olympiad Bench](https://huggingface.co/datasets/Hothan/OlympiadBench)
179
+ - [College Math](https://huggingface.co/datasets/realtreetune/college_math)
180
+
181
+
182
+ ### Results
183
+
184
+ [More Information Needed]
185
+
186
+ <!-- #### Summary -->
187
+
188
+
189
+ ## Citation
190
+
191
+ ```bibtex
192
+ @misc{pala2025errortypingsmarterrewards,
193
+ title={Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision},
194
+ author={Tej Deep Pala and Panshul Sharma and Amir Zadeh and Chuan Li and Soujanya Poria},
195
+ year={2025},
196
+ eprint={2505.12345},
197
+ archivePrefix={arXiv},
198
+ primaryClass={cs.LG},
199
+ url={https://arxiv.org/abs/2505.12345},
200
+ }
201
+ ```