distily_bench_gpt2_activation_loss_b

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

  • eval_enwikippl: 225.9773
  • eval_frwikippl: 1391.1320
  • eval_zhwikippl: 821.2236
  • eval_loss: 19.6630
  • eval_runtime: 17.2806
  • eval_samples_per_second: 57.868
  • eval_steps_per_second: 7.234

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=2.0, loss_fn=ce, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
  • train_embeddings: True
  • learning_rate: 4e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: constant
  • num_epochs: 1.0

Resource Usage

Peak GPU Memory: 8.0903 GB

Eval-Phase Metrics

step epoch enwikippl frwikippl loss runtime samples_per_second steps_per_second zhwikippl
teacher eval 30.2086 57.2728 18.1784
0 0 55429.6875 57698.8047 24.5150 17.2943 57.823 7.228 56988.9141
1000 0.0808 713.7677 4453.7666 20.3910 17.3531 57.627 7.203 17866.8926
2000 0.1616 521.2028 3308.0386 20.2010 17.3798 57.538 7.192 2471.2515
3000 0.2424 433.2541 2722.2993 20.1000 17.3672 57.58 7.197 1283.4985
4000 0.3232 387.5081 2569.3728 20.0170 17.3651 57.587 7.198 1167.0867
5000 0.4040 332.2302 2197.1006 19.9310 17.283 57.86 7.233 1141.8051
6000 0.4848 292.5944 1835.8154 19.8590 17.2939 57.824 7.228 905.3102
7000 0.5657 266.3748 1648.5508 19.7820 17.3184 57.742 7.218 844.8045
8000 0.6465 244.8321 1513.9550 19.7310 17.3028 57.794 7.224 1150.9904
9000 0.7273 225.9773 1391.1320 19.6630 17.2806 57.868 7.234 821.2236
10000 0.8081 209.6788 1266.0754 19.6040 17.3446 57.655 7.207 718.9499
11000 0.8889 196.7588 1248.5234 19.5620 17.3611 57.6 7.2 611.5998
12000 0.9697 179.4194 1137.2484 19.5120 17.3767 57.548 7.194 572.3267
12375 1.0 175.7241 1080.9574 19.4920 17.3076 57.778 7.222 584.9987

Framework versions

  • Distily 0.2.0
  • Transformers 4.44.0
  • Pytorch 2.3.0
  • Datasets 2.21.0
Downloads last month
2
Safetensors
Model size
0.1B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for lapp0/distily_bench_gpt2_activation_loss_b

Quantized
(82)
this model