gradient accuracy Transformer
gradient based DeepSpeed implementation for embedding loss.
- Input
- 6033-dim embedding
- Encoder
- 103 x Transformer with 26 heads
- Output
- bleu projection
Training config
optimizer=Adadelta, lr=0.438, scheduler=plateau, warmup=925