T5: Tips for finetuning on crossword clues (clue => answer)

jsrozner · October 13, 2020, 7:39pm

As a baseline for a research project, I am trying to finetune T5 on a large crossword clue set (130,000 clues), where the source is a Clue, and the target is an Answer.

I am using T5ForConditionalGeneration and the finetune.py script (examples/seq2seq). I started with T5-small.
My source/target files have one pair per line (<Clue>\n, <Answer>\n).
I started with from_pretrained(t5-small) for both model and tokenizer.
I didn’t add any tokens to the vocabulary.

The initial run gave me only gibberish (long strings of entirely non-English outputs), so I am trying an even simpler task: Can T5 learn to select the first word of the input sentence. I.e., I’ve modified inputs and outputs to be something like
source: This is a clue with some normal language
target: This

Where again each entry is on its own line.

I observed (under the same training regime as above, with T5-small) that, after 300 epochs, the model gives outputs that look like
<first word> <long string of gibberish>

I wonder if anyone has some ideas:

Is there any issue with having only one word targets? I.e., Should I be using a different loss function than the default? At epoch 2, my loss was already down to 0.001. Rouge scores were around 1.5.
I did not change the task name. The finetune.py script seems to default to adding task name summarization. Maybe I should remove or change the task name?
How would T5 small vs T5 base or large change the results?
I did not add separator tokens, but I think I am not required to given that the example (e.g. finetune_bart_tiny) does not add separator tokens
My inputs and outputs generally do not have punctuation (i.e. the clues don’t end in a period, and the answers don’t end in a period). I wonder if this would help?
I read the through /static-proxy?url=https%3A%2F%2Fdiscuss.huggingface.co%2Ft%2Ft5-finetuning-tips%2F684%3C%2Fa%3E but I’m not sure if those tweaks will change the results here. I’m mostly just slightly adapting the given finetune_t5.sh script.

What’s a reasonable number of epochs of finetuning (using a 60% split of 130,000, so roughly 80k training examples) before I should expect the model to learn to output the first word?

I also have an implementation question:
Is there a way to get the finetune.py script to print validation results at every epoch so that I can see how the model is learning (qualitatively) over time?

October 14, 2020, 9:10pm

I filed this bug for the gibberish outputs I am observing

Topic		Replies	Views
Finetuning T5 on custom data Models	0	1074	November 13, 2020
Fine-tune T5-small but lower performance Models	0	1425	April 21, 2022
HF Trainer: HF trainer cause a problem while fine-tuning T5 (T5 doesn't generate eos token at proper point) 🤗Transformers	0	837	March 6, 2022
Issue with finetuning a seq-to-seq model 🤗Transformers	30	4001	August 11, 2022
T5 Finetuning not converging Models	0	491	August 18, 2023

T5: Tips for finetuning on crossword clues (clue => answer)

Related topics