Update README.md
Browse files
README.md
CHANGED
|
@@ -42,7 +42,7 @@ In addition to this, we noticed that Mistral Large models seemed much more sensi
|
|
| 42 |

|
| 43 |
|
| 44 |
We hypothesize this is primarily due to the particularly narrow and low variance weight distributions typical of Mistral derived models regardless of their scale.
|
| 45 |
-
In the end, we settled on 2e-6 with an effective batch size of 64 (and a packed tokens batch size of 8192;
|
| 46 |
|
| 47 |
We also trained with a weight decay of 0.01 to help further stabilize the loss trajectory and mitigate overfitting.
|
| 48 |
|
|
|
|
| 42 |

|
| 43 |
|
| 44 |
We hypothesize this is primarily due to the particularly narrow and low variance weight distributions typical of Mistral derived models regardless of their scale.
|
| 45 |
+
In the end, we settled on 2e-6 with an effective batch size of 64 (and a packed tokens batch size of 8192; effectively ~500,000 tokens per batch).
|
| 46 |
|
| 47 |
We also trained with a weight decay of 0.01 to help further stabilize the loss trajectory and mitigate overfitting.
|
| 48 |
|