zephyr-7b-dpo-full-alpha_0.5_batch32

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.7742
Rewards/chosen: -1.3339
Rewards/rejected: -2.3724
Rewards/accuracies: 0.7738
Rewards/margins: 1.0385
Logps/rejected: -497.4416
Logps/chosen: -415.3672
Logits/rejected: 1.0737
Logits/chosen: -0.2206

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 4
total_train_batch_size: 32
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Logits/chosen	Logits/rejected	Logps/chosen	Logps/rejected	Validation Loss	Rewards/accuracies	Rewards/chosen	Rewards/margins	Rewards/rejected
0.9799	0.0523	100	-2.6053	-2.5499	-279.5893	-262.4359	0.9798	0.7004	0.0239	0.0462	-0.0223
0.9062	0.1047	200	-2.0394	-1.8739	-369.8690	-384.9631	0.8992	0.7044	-0.8789	0.3687	-1.2476
0.8625	0.1570	300	-0.7556	-0.4321	-362.7793	-402.3089	0.9493	0.7401	-0.8080	0.6130	-1.4211
0.8393	0.2093	400	0.4949	1.2281	-403.6841	-443.1324	0.8450	0.7381	-1.2171	0.6122	-1.8293
0.7936	0.2616	500	0.4250	1.2276	-439.1208	-499.2642	0.8415	0.7381	-1.5714	0.8192	-2.3906
0.8417	0.3140	600	0.5051	1.2799	-390.0083	-451.7932	0.8105	0.7599	-1.0803	0.8356	-1.9159
0.7909	0.3663	700	0.4495	1.4415	-390.6981	-460.0458	0.8043	0.75	-1.0872	0.9112	-1.9984
0.8545	0.4186	800	0.9682	1.7931	-463.3611	-521.2473	0.8065	0.7560	-1.8138	0.7966	-2.6104
0.7903	0.4710	900	0.8040	-0.8990	-1.7305	0.7579	0.8316	-433.2554	-371.8721	-0.4830	-1.1343
0.7805	0.5233	1000	0.7874	-1.5900	-2.5776	0.7738	0.9876	-517.9631	-440.9751	1.2055	0.1112
0.7927	0.5756	1100	0.7853	-1.4465	-2.4111	0.7579	0.9646	-501.3155	-426.6308	0.3121	-0.7289
0.7714	0.6279	1200	0.7814	-1.2916	-2.3130	0.7679	1.0213	-491.5005	-411.1409	0.3216	-0.8372
0.7514	0.6803	1300	0.7838	-1.2233	-2.2459	0.7718	1.0226	-484.7902	-404.3044	0.5839	-0.7223
0.7356	0.7326	1400	0.7767	-1.4917	-2.5388	0.7698	1.0471	-514.0866	-431.1516	1.2245	-0.1178
0.7475	0.7849	1500	0.7756	-1.3568	-2.3583	0.7639	1.0016	-496.0364	-417.6552	1.0529	-0.2127
0.7625	0.8373	1600	0.7751	-1.2270	-2.2379	0.7778	1.0108	-483.9888	-404.6796	0.7870	-0.4206
0.7493	0.8896	1700	0.7748	-1.3789	-2.4259	0.7718	1.0470	-502.7920	-419.8630	1.2325	-0.0891
0.7604	0.9419	1800	0.7743	-1.3323	-2.3703	0.7718	1.0381	-497.2375	-415.2034	1.0742	-0.2202
0.7654	0.9942	1900	0.7743	-1.3347	-2.3725	0.7718	1.0378	-497.4510	-415.4467	1.0747	-0.2184

Framework versions

Transformers 4.44.2
Pytorch 2.2.1+cu118
Datasets 2.14.7
Tokenizers 0.19.1

Downloads last month: 9

Safetensors

Model size

7B params

Tensor type

BF16

Model tree for YeongminKim/zephyr-7b-dpo-full-alpha_0.5_batch32

Base model

mistralai/Mistral-7B-v0.1

Finetuned

alignment-handbook/zephyr-7b-sft-full

Finetuned

(363)

this model

Dataset used to train YeongminKim/zephyr-7b-dpo-full-alpha_0.5_batch32

Evaluation results

Metadata error: specify a dataset to view leaderboard