For language modeling, do you think many attention heads per layer and many layers are necessary for near SotA results? What do you believe versus what do you think we have scientific proof for? Are you thinking about only one architecture? Are others possible? Are LSTMs dead?
-
-
This deserves so many more likes
Hvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi
-
Čini se da učitavanje traje već neko vrijeme.
Twitter je možda preopterećen ili ima kratkotrajnih poteškoća u radu. Pokušajte ponovno ili potražite dodatne informacije u odjeljku Status Twittera.