More Dutch BERT explorations: since we used scalar weighting, we can see what layers are used per task. As is common in finetuning of pretrained models, we freeze the encoder weight for the first epoch, to avoid that large softmax gradients 'destroy' the encoder 1/4.
The final, finetuned models, use each layer almost equally. The interesting patterns here seem to be: for lemmatization, almost all layers are used equally. For the other tasks, use increases per layer. 4/4pic.twitter.com/LqoAQDaYXM
, Nix/NixOS
, occasional tinkerer with electronics. Dad of a Lego queen
.
Opinions are my own.