More Dutch BERT explorations: since we used scalar weighting, we can see what layers are used per task. As is common in finetuning of pretrained models, we freeze the encoder weight for the first epoch, to avoid that large softmax gradients 'destroy' the encoder 1/4.
This can give hints where information is represented in the initial encoder. Interestingly, the information seems to be distributed quite differently between ML BERT and BERTje. In both, the last layers are used most for dependencies and the initial layers for lemmatization. 2/4pic.twitter.com/KJKaVYklei
-
-
However, for morphological and POS tagging, it seems that for ML BERT the classifiers rely most on the initial/middle layers, whereas for BERTje more uniformly on all the layers. 3/4
Prikaži ovu nit -
The final, finetuned models, use each layer almost equally. The interesting patterns here seem to be: for lemmatization, almost all layers are used equally. For the other tasks, use increases per layer. 4/4pic.twitter.com/LqoAQDaYXM
Prikaži ovu nit
Kraj razgovora
Novi razgovor -
Čini se da učitavanje traje već neko vrijeme.
Twitter je možda preopterećen ili ima kratkotrajnih poteškoća u radu. Pokušajte ponovno ili potražite dodatne informacije u odjeljku Status Twittera.
, Nix/NixOS
, occasional tinkerer with electronics. Dad of a Lego queen
.
Opinions are my own.