view reply Great work on this and thanks for the detailed write up. In our experience this approach has worked really well for larger-scale multi-node training. We've seen up to 3x improvement in training speed training 32b models.