ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations Paper • 2505.02819 • Published May 5 • 26
Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning Paper • 2508.04581 • Published Aug 6 • 5
view article Article Sparse Mixture of Experts Language Model from Scratch: Extending makeMoE with Expert Capacity By AviSoori1x • Mar 18, 2024 • 13