dpo_40k_abla_one_cat_both

This model is a fine-tuned version of /p/scratch/taco-vlm/xiao4/models/Qwen2.5-VL-7B-Instruct on the dpo_ablation_one_cat_both dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 2
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 4
gradient_accumulation_steps: 8
total_train_batch_size: 64
total_eval_batch_size: 4
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1.0

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
0.6899	0.0804	50	0.6893	-0.0073	-0.0155	0.5850	0.0082	-31.0218	-36.1705	0.5865	0.5867
0.6718	0.1608	100	0.6649	-0.0592	-0.1219	0.6650	0.0627	-31.5413	-37.2349	0.5690	0.5871
0.6385	0.2412	150	0.6239	-0.1638	-0.3294	0.7100	0.1656	-32.5863	-39.3094	0.5662	0.5646
0.5641	0.3216	200	0.5847	-0.2708	-0.5575	0.7450	0.2867	-33.6567	-41.5902	0.5242	0.5252
0.5387	0.4020	250	0.5526	-0.3354	-0.7521	0.7400	0.4168	-34.3023	-43.5367	0.4843	0.4783
0.5469	0.4824	300	0.5320	-0.3738	-0.8901	0.75	0.5164	-34.6866	-44.9168	0.4345	0.4390
0.4983	0.5628	350	0.5195	-0.4765	-1.0702	0.7750	0.5937	-35.7137	-46.7178	0.3969	0.3958
0.476	0.6432	400	0.5069	-0.5246	-1.1857	0.7700	0.6611	-36.1952	-47.8728	0.3678	0.3619
0.489	0.7236	450	0.5003	-0.5211	-1.2136	0.7700	0.6925	-36.1599	-48.1518	0.3441	0.3489
0.4826	0.8040	500	0.4943	-0.5300	-1.2462	0.7700	0.7162	-36.2489	-48.4776	0.3410	0.3310
0.479	0.8844	550	0.4944	-0.5438	-1.2674	0.7700	0.7236	-36.3868	-48.6898	0.3345	0.3348
0.4933	0.9648	600	0.4926	-0.5464	-1.2752	0.7700	0.7288	-36.4127	-48.7677	0.3380	0.3330

Base model

Adapter

(127)

this model