I find the boom layer a bit hard to motivate. It has the same (theoretical) computation cost as a stack of N residual layers. The latter should outperform it, though.
-
-
-
Sorry for the delayed reply, agreed re: theoretically better. A stack of N residuals (N=4) converged more slowly in early training than the Boom layer so I went that direction. Maybe longer running experiments would show differently but my heuristic is early progress is good.
- Još 2 druga odgovora
Novi razgovor -
-
-
1. Did you / do you plan to perform ablation tests w/ Boom layer? 2. What's with the concatenated memory M? I don't really see how that's used in the computation graph?
-
- I performed some amount of analysis regarding Boom versus no Boom versus traditional feedforward. None that I'd link to as they're usually conflated with others. If I had extra compute I'd isolate various factors. - I realize I need to be clearer on the memory aspect
- Još 2 druga odgovora
Novi razgovor -
-
-
To check my understanding, Your claim :rare words sub words are easier to predict since the first subword is drawn from set smaller than words & teacher forcing makes rest easy, though num possiblity of multiple subwords is more only certain paths are trained because of long tail
-
+1. Wordpieces aim to equalize entropy across tokens. That means tokens with high entropy are broken apart, including suffixes and prefixes, or into compositional fragments. Especially pronounced when it's wordpieces vs words as the latter makes no attempt to equalize entropy.
- Još 4 druga odgovora
Novi razgovor -
-
-
is it robust to random random seed. last time I saw complaints the random seed is hyperparameter
-
It is robust to random seed for many experiments and model variants I ran through. Not all got the same but all fell in a similar range. Similar numbers for different numbers of layers too (i.e. 3 vs 4 layers) with entirely different seeds if that helps reaffirm for you too =]
Kraj razgovora
Novi razgovor -
-
-
When you explain the overparametrization, you refer to Figure 3 to visualize the example, but it is in fact a plot.
-
Fixed! Labels seem to randomly flock from figure to figure as my copy paste grows ever more error prone. Thank you :)
Kraj razgovora
Novi razgovor -
Čini se da učitavanje traje već neko vrijeme.
Twitter je možda preopterećen ili ima kratkotrajnih poteškoća u radu. Pokušajte ponovno ili potražite dodatne informacije u odjeljku Status Twittera.
in SF.