T5: Tips for finetuning on crossword clues (clue => answer)

As a baseline for a research project, I am trying to finetune T5 on a large crossword clue set (130,000 clues), where the source is a Clue, and the target is an Answer.

  • I am using T5ForConditionalGeneration and the finetune.py script (examples/seq2seq). I started with T5-small.
  • My source/target files have one pair per line (<Clue>\n, <Answer>\n).
  • I started with from_pretrained(t5-small) for both model and tokenizer.
  • I didn’t add any tokens to the vocabulary.

The initial run gave me only gibberish (long strings of entirely non-English outputs), so I am trying an even simpler task: Can T5 learn to select the first word of the input sentence. I.e., I’ve modified inputs and outputs to be something like
source: This is a clue with some normal language
target: This

Where again each entry is on its own line.

I observed (under the same training regime as above, with T5-small) that, after 300 epochs, the model gives outputs that look like
<first word> <long string of gibberish>

I wonder if anyone has some ideas:

I also have an implementation question:
Is there a way to get the finetune.py script to print validation results at every epoch so that I can see how the model is learning (qualitatively) over time?

I filed this bug for the gibberish outputs I am observing