Nikos Aletras(@nikaletras) 's Twitter Profile Photo

If you missed Huiyin Xue's poster yesterday on HashFormers, you can find the video/poster/slides here: underline.io/events/342/ses…

account_circle
Huiyin Xue(@HuiyinXue) 's Twitter Profile Photo

MHE attention only requires a negligible fraction of additional parameters (3nd, where n is the number of attention heads and d the size of the head embeddings) compared to a single-head attention, while MHA requires (3n^2−3n)d^2−3nd additional parameters.
[4/k]

account_circle
Huiyin Xue(@HuiyinXue) 's Twitter Profile Photo

MHE is substantially more memory efficient compared to alternative attention mechanisms while achieving high predictive performance retention ratio to vanilla MHA on several downstream tasks.
[3/k]

account_circle