1 SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs We propose SLoPe, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining method for LLMs that improves the accuracy of sparse LLMs while accelerating their pretraining and inference and reducing their memory footprint. Sparse pretraining of LLMs reduces the accuracy of the model, to overcome this, prior work uses dense models during fine-tuning. SLoPe improves the accuracy of sparsely pretrained models by adding low-rank adapters in the final 1% iterations of pretraining without adding significant overheads to the model pretraining and inference. In addition, SLoPe uses a double-pruned backward pass formulation that prunes the transposed weight matrix using N:M sparsity structures to enable an accelerated sparse backward pass. SLoPe accelerates the training and inference of models with billions of parameters up to 1.14times and 1.34times respectively (OPT-33B and OPT-66B) while reducing their memory usage by up to 0.77times and 0.51times for training and inference respectively. 4 authors · May 25, 2024 2
- IMF slope derived from a pure probabilistic model The stellar initial mass function is of great significance for the study of star formation and galactic structure. Observations indicate that the IMF follows a power-law form. This work derived that when the expected number of stars formed from a spherical molecular cloud is much greater than 1, there is a relationship between the slope alpha of the IMF and r^n in the radius-density relation of spherically symmetric gas clouds, given by alpha = 3/(n+3) (Gamma_{IMF} = n/(n+3)). This conclusion is close to the results of numerical simulations and observations, but it is derived from a pure probabilistic model, which may have underlying reasons worth pondering. 1 authors · Mar 18
- Zapped then Napped? A rapidly quenched remnant leaker candidate with a steep spectroscopic $β_{UV}$ slope at z=8.5 We use NIRSpec MSA spectroscopy and NIRCam Photometry to explore the properties of JADES-GS8-RL-1, a rapidly quenched, z=8.5 galaxy with a stellar mass of 10^{8.9}M_odot, a steep blue UV slope, a Balmer break, and no sign of strong emission lines. With a beta_{UV}=-2.8pm 0.2, as measured from the NIRSpec spectrum, JADES-GS8-RL-1 is consistent with negligible dust attenuation and little to no contribution from the nebular continuum alongside a probable high escape fraction. The beta_{UV} slope measured from photometry varies from -3.0 in the central regions to -2.2 at the outskirts suggesting possible regional differences in the escape fraction. There are no high-ionisation emission lines, only a tentative 2.9\sig detection of [OII]. Using photometry, this emission appears to be extended, possibly corresponding to weakly ionised gas expelled during or after the quenching process. JADES-GS8-RL-1 is spatially resolved with a half-light radius of 240 pc and has an exponential, disc-like morphology. It appears to have formed all its stars in a short burst within the past 100 Myr with a formation time of approx70 Myr and a quenching time of approx30 Myr. This quenching would have occurred rapidly, making it a more distant example of the kind of low-mass "mini-quenched" galaxies previously observed at high-z. Due to the extremely blue beta_{UV} slope, our best-fit model predicts a high value for \fesc of >10\%, consistent with the value derived from the beta_{UV} slope, which when combined with our extraordinarily low O32 upper limit suggests JADES-GS8-RL-1 is a fascinating example of a high-z "remnant leaker" in one of its earliest phases, deep in the epoch of reionisation. 20 authors · Jan 15
- Strong Screening Rules for Group-based SLOPE Models Tuning the regularization parameter in penalized regression models is an expensive task, requiring multiple models to be fit along a path of parameters. Strong screening rules drastically reduce computational costs by lowering the dimensionality of the input prior to fitting. We develop strong screening rules for group-based Sorted L-One Penalized Estimation (SLOPE) models: Group SLOPE and Sparse-group SLOPE. The developed rules are applicable to the wider family of group-based OWL models, including OSCAR. Our experiments on both synthetic and real data show that the screening rules significantly accelerate the fitting process. The screening rules make it accessible for group SLOPE and sparse-group SLOPE to be applied to high-dimensional datasets, particularly those encountered in genetics. 2 authors · May 24, 2024
- Novel Quadratic Constraints for Extending LipSDP beyond Slope-Restricted Activations Recently, semidefinite programming (SDP) techniques have shown great promise in providing accurate Lipschitz bounds for neural networks. Specifically, the LipSDP approach (Fazlyab et al., 2019) has received much attention and provides the least conservative Lipschitz upper bounds that can be computed with polynomial time guarantees. However, one main restriction of LipSDP is that its formulation requires the activation functions to be slope-restricted on [0,1], preventing its further use for more general activation functions such as GroupSort, MaxMin, and Householder. One can rewrite MaxMin activations for example as residual ReLU networks. However, a direct application of LipSDP to the resultant residual ReLU networks is conservative and even fails in recovering the well-known fact that the MaxMin activation is 1-Lipschitz. Our paper bridges this gap and extends LipSDP beyond slope-restricted activation functions. To this end, we provide novel quadratic constraints for GroupSort, MaxMin, and Householder activations via leveraging their underlying properties such as sum preservation. Our proposed analysis is general and provides a unified approach for estimating ell_2 and ell_infty Lipschitz bounds for a rich class of neural network architectures, including non-residual and residual neural networks and implicit models, with GroupSort, MaxMin, and Householder activations. Finally, we illustrate the utility of our approach with a variety of experiments and show that our proposed SDPs generate less conservative Lipschitz bounds in comparison to existing approaches. 7 authors · Jan 25, 2024