More Dutch BERT explorations: since we used scalar weighting, we can see what layers are used per task. As is common in finetuning of pretrained models, we freeze the encoder weight for the first epoch, to avoid that large softmax gradients 'destroy' the encoder 1/4.
However, for morphological and POS tagging, it seems that for ML BERT the classifiers rely most on the initial/middle layers, whereas for BERTje more uniformly on all the layers. 3/4
-
-
The final, finetuned models, use each layer almost equally. The interesting patterns here seem to be: for lemmatization, almost all layers are used equally. For the other tasks, use increases per layer. 4/4pic.twitter.com/LqoAQDaYXM
Prikaži ovu nitHvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi
-
Čini se da učitavanje traje već neko vrijeme.
Twitter je možda preopterećen ili ima kratkotrajnih poteškoća u radu. Pokušajte ponovno ili potražite dodatne informacije u odjeljku Status Twittera.
, Nix/NixOS
, occasional tinkerer with electronics. Dad of a Lego queen
.
Opinions are my own.