new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Nov 21

The implications of stochastic gas torques for asymmetric binaries in the LISA band

Gravitational waves from asymmetric mass-ratio black-hole binaries carry unique information about their astrophysical environment. For instance, the Laser Interferometer Space Antenna (LISA) could potentially measure the amplitude and slope of gas torques in binaries embedded in the accretion disks of Active Galactic Nuclei, helping differentiate competing accretion disk models. However, this relies on simplified analytic models, which do not account for the stochastic variability of torques seen in hydrodynamic simulations. In this work, we use hydrodynamic simulations to create gravitational waveforms for extreme and intermediate mass-ratio inspirals in the LISA band. We then analyze these simulated waveforms using simpler templates that assume analytic torques, without stochastic time variability. By performing realistic Bayesian parameter estimation, we find no bias at 90% confidence in the binary parameters; however, estimates of accretion disk parameters, such as torque amplitude and slope, may be biased. Typically, the posterior distribution is centered around the average value of the torques, but when stochastic variability is large, the posterior can indicate no torques, even though they are present in the simulation. Our results suggest that while simplified analytic torque models work well for estimating binary parameters, caution is needed when using them to infer properties of the accretion disk. This work moves towards a more realistic assessment of one of the LISA science objectives, i.e., probing the properties of the astrophysical environments of black holes.

  • 5 authors
·
Feb 14

The NANOGrav 15-year Data Set: Observations and Timing of 68 Millisecond Pulsars

We present observations and timing analyses of 68 millisecond pulsars (MSPs) comprising the 15-year data set of the North American Nanohertz Observatory for Gravitational Waves (NANOGrav). NANOGrav is a pulsar timing array (PTA) experiment that is sensitive to low-frequency gravitational waves. This is NANOGrav's fifth public data release, including both "narrowband" and "wideband" time-of-arrival (TOA) measurements and corresponding pulsar timing models. We have added 21 MSPs and extended our timing baselines by three years, now spanning nearly 16 years for some of our sources. The data were collected using the Arecibo Observatory, the Green Bank Telescope, and the Very Large Array between frequencies of 327 MHz and 3 GHz, with most sources observed approximately monthly. A number of notable methodological and procedural changes were made compared to our previous data sets. These improve the overall quality of the TOA data set and are part of the transition to new pulsar timing and PTA analysis software packages. For the first time, our data products are accompanied by a full suite of software to reproduce data reduction, analysis, and results. Our timing models include a variety of newly detected astrometric and binary pulsar parameters, including several significant improvements to pulsar mass constraints. We find that the time series of 23 pulsars contain detectable levels of red noise, 10 of which are new measurements. In this data set, we find evidence for a stochastic gravitational-wave background.

  • 100 authors
·
Jun 28, 2023

The Binary Fraction of Red Supergiants in the Magellanic Clouds

Red supergiants (RSGs), as the descendants of OB-type stars and the progenitors of supernovae, provide crucial insights into the evolution of massive stars, particularly in binary systems. Previous studies show that the binary fraction of RSGs (approx 15% - 40%) is significantly lower than that of their predecessors (approx 50% - 70%). In this work, we investigate the binary fraction of RSGs with the recently selected largest samples of 4695 and 2097 RSGs in the Large Magellanic Cloud (LMC) and Small Magellanic Cloud (SMC), respectively. The binary system with a hot companion (O-, B- and A-type star) is identified by detecting the ultraviolet (UV) excess in the observed spectral energy distribution (SED) ranging from ultraviolet to mid-infrared after subtracting the model SED of RSG since RSGs are very weak in the UV band. It is found that the lower limit of binarity is 30.2% pm 0.7% and 32.2% pm 1% in the LMC and SMC, respectively. If the sample is limited to luminous RSGs with log L/L_{odot} > 4.0, the binary fraction becomes 26.6% pm 1.1% and 26.4% pm 1.7% in the LMC and SMC, respectively. The derived binary fraction is valid in the range of sim 2.3 < log P / [d] < sim 8. Our study suggests that roughly one-third of massive stars host a third companion within sim 30,000 AU. In addition, 15 RSGs are also identified as binary via HST/STIS spectra, and a handful of the binaries identified by the SED fitting are confirmed by their light curve and radial velocity dispersion. The stellar parameters of the companions, i.e. T_{eff}, R, L and log g, are calculated by model fitting.

  • 3 authors
·
Apr 4

Phemenological Modelling of a Group of Eclipsing Binary Stars

Phenomenological modeling of variable stars allows determination of a set of the parameters, which are needed for classification in the "General Catalogue of Variable Stars" and similar catalogs. We apply a recent method NAV ("New Algol Variable") to eclipsing binary stars of different types. Although all periodic functions may be represented as Fourier series with an infinite number of coefficients, this is impossible for a finite number of the observations. Thus one may use a restricted Fourier series, i.e. a trigonometric polynomial (TP) of order s either for fitting the light curve, or to make a periodogram analysis. However, the number of parameters needed drastically increases with decreasing width of minimum. In the NAV algorithm, the special shape of minimum is used, so the number of parameters is limited to 10 (if the period and initial epoch are fixed) or 12 (not fixed). We illustrate the NAV method by application to a recently discovered Algol-type eclipsing variable 2MASS J11080308-6145589 (in the field of previously known variable star RS Car) and compare results to that obtained using the TP fits. For this system, the statistically optimal number of parameters is 44, but the fit is still worse than that of the NAV fit. Application to the system GSC 3692-00624 argues that the NAV fit is better than the TP one even for the case of EW-type stars with much wider eclipses. Model parameters are listed.

  • 3 authors
·
Sep 17, 2015

Phemenological Modeling of Eclipsing Binary Stars

We review the method NAV (New Algol Variable) first introduced in 2012Ap.....55..536A, which uses the locally-dependent shapes of eclipses in an addition to the trigonometric polynomial of the second order (which typically describes the "out-of-eclipse" part of the light curve with effects of reflection, ellipticity and O'Connell). Eclipsing binary stars are believed to show distinct eclipses only if belonging to the EA type. With a decreasing eclipse width, the statistically optimal value of the trigonometric polynomial s (2003ASPC..292..391A) drastically increases from ~2 for elliptic (EL) variables without eclipses, ~6-8 for EW and up to ~30-50 for some EA with narrow eclipses. In this case of large number of parameters, the smoothing curve becomes very noisy and apparent waves (the Gibbs phenomenon) may be seen. The NAV set of the parameters may be used for classification in the GCVS, VSX and similar catalogs. The maximal number of parameters is m=12, which corresponds to s=5, if correcting both the period and the initial epoch. We have applied the method to few stars, also in a case of multi-color photometry (2015JASS...32..127A), when it is possible to use the phenomenological parameters from the NAV fit to estimate physical parameters using statistical dependencies. We conclude that the NAV approximation is better than the TP one even for the case of EW-type stars with much wider eclipses. It may also be used to determine timings (see 2005ASPC..335...37A for a review of methods) or to determine parameters in the case of variable period, using a complete light curve modeling the phase variations. The method is illustrated on 2MASS J11080447-6143290 (EA-type), USNO-B1.0 1265-0306001 and USNO-B1.0 1266-0313413 (EW-type) and compared to various other methods from the literature.

  • 3 authors
·
Feb 12, 2016

Efficient Massive Black Hole Binary parameter estimation for LISA using Sequential Neural Likelihood

The inspiral, merger, and ringdown of Massive Black Hole Binaries (MBHBs) is one the main sources of Gravitational Waves (GWs) for the future Laser Interferometer Space Antenna (LISA), an ESA-led mission in the implementation phase. It is expected that LISA will detect these systems throughout the entire observable universe. Robust and efficient data analysis algorithms are necessary to detect and estimate physical parameters for these systems. In this work, we explore the application of Sequential Neural Likelihood, a simulation-based inference algorithm, to detect and characterize MBHB GW signals in synthetic LISA data. We describe in detail the different elements of the method, their performance and possible alternatives that can be used to enhance the performance. Instead of sampling from the conventional likelihood function, which requires a forward simulation for each evaluation, this method constructs a surrogate likelihood that is ultimately described by a neural network trained from a dataset of simulations of the MBHB signals and noise. One important advantage of this method is that, given that the likelihood is independent of the priors, we can iteratively train models that target specific observations in a fraction of the time and computational cost that other traditional and machine learning-based strategies would require. Because of the iterative nature of the method, we are able to train models to obtain qualitatively similar posteriors with less than 2\% of the simulator calls that Markov Chain Monte Carlo methods would require. We compare these posteriors with those obtained from Markov Chain Monte Carlo techniques and discuss the differences that appear, in particular in relation with the important role that data compression has in the modular implementation of the method that we present. We also discuss different strategies to improve the performance of the algorithms.

  • 2 authors
·
Jun 1, 2024

Modelling the accretion and feedback of supermassive black hole binaries in gas-rich galaxy mergers

We introduce a new model for the accretion and feedback of supermassive black hole (SMBH) binaries to the KETJU code, which enables us to resolve the evolution of SMBH binaries down to separations of tens of Schwarzschild radii in gas-rich galaxy mergers. Our subgrid binary accretion model extends the widely used Bondi--Hoyle--Lyttleton accretion into the binary phase and incorporates preferential mass accretion onto the secondary SMBH, which is motivated by results from small-scale hydrodynamical circumbinary disc simulations. We perform idealised gas-rich disc galaxy merger simulations using pure thermal or pure kinetic active galactic nuclei (AGN) feedback. Our binary accretion model provides more physically motivated SMBH mass ratios, which are one of the key parameters for computing gravitational wave (GW) induced recoil velocities. The merger time-scales of our simulated SMBH binaries are in the range t_{rm merge}{sim} 10--400 Myr. Prograde in-plane equal-mass galaxy mergers lead to the shortest merger time-scales, as they experience the strongest starbursts, with the ensuing high stellar density resulting in a rapid SMBH coalescence. Compared to the thermal AGN feedback, the kinetic AGN feedback predicts longer merger time-scales and results in more core-like stellar profiles, as it is more effective in removing gas from the galaxy centre and quenching star formation. This suggests that the AGN feedback implementation plays a critical role in modelling SMBH coalescences. Our model will be useful for improving the modelling of SMBH mergers in gas-rich galaxies, the prime targets for the upcoming LISA GW observatory.

  • 9 authors
·
Nov 21, 2022

Applying Pre-trained Multilingual BERT in Embeddings for Improved Malicious Prompt Injection Attacks Detection

Large language models (LLMs) are renowned for their exceptional capabilities, and applying to a wide range of applications. However, this widespread use brings significant vulnerabilities. Also, it is well observed that there are huge gap which lies in the need for effective detection and mitigation strategies against malicious prompt injection attacks in large language models, as current approaches may not adequately address the complexity and evolving nature of these vulnerabilities in real-world applications. Therefore, this work focuses the impact of malicious prompt injection attacks which is one of most dangerous vulnerability on real LLMs applications. It examines to apply various BERT (Bidirectional Encoder Representations from Transformers) like multilingual BERT, DistilBert for classifying malicious prompts from legitimate prompts. Also, we observed how tokenizing the prompt texts and generating embeddings using multilingual BERT contributes to improve the performance of various machine learning methods: Gaussian Naive Bayes, Random Forest, Support Vector Machine, and Logistic Regression. The performance of each model is rigorously analyzed with various parameters to improve the binary classification to discover malicious prompts. Multilingual BERT approach to embed the prompts significantly improved and outperformed the existing works and achieves an outstanding accuracy of 96.55% by Logistic regression. Additionally, we investigated the incorrect predictions of the model to gain insights into its limitations. The findings can guide researchers in tuning various BERT for finding the most suitable model for diverse LLMs vulnerabilities.

  • 4 authors
·
Sep 20, 2024

Barlow-Swin: Toward a novel siamese-based segmentation architecture using Swin-Transformers

Medical image segmentation is a critical task in clinical workflows, particularly for the detection and delineation of pathological regions. While convolutional architectures like U-Net have become standard for such tasks, their limited receptive field restricts global context modeling. Recent efforts integrating transformers have addressed this, but often result in deep, computationally expensive models unsuitable for real-time use. In this work, we present a novel end-to-end lightweight architecture designed specifically for real-time binary medical image segmentation. Our model combines a Swin Transformer-like encoder with a U-Net-like decoder, connected via skip pathways to preserve spatial detail while capturing contextual information. Unlike existing designs such as Swin Transformer or U-Net, our architecture is significantly shallower and competitively efficient. To improve the encoder's ability to learn meaningful features without relying on large amounts of labeled data, we first train it using Barlow Twins, a self-supervised learning method that helps the model focus on important patterns by reducing unnecessary repetition in the learned features. After this pretraining, we fine-tune the entire model for our specific task. Experiments on benchmark binary segmentation tasks demonstrate that our model achieves competitive accuracy with substantially reduced parameter count and faster inference, positioning it as a practical alternative for deployment in real-time and resource-limited clinical environments. The code for our method is available at Github repository: https://github.com/mkianih/Barlow-Swin.

  • 5 authors
·
Sep 8

Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models

Prompt tuning and adapter tuning have shown great potential in transferring pre-trained vision-language models (VLMs) to various downstream tasks. In this work, we design a new type of tuning method, termed as regularized mask tuning, which masks the network parameters through a learnable selection. Inspired by neural pathways, we argue that the knowledge required by a downstream task already exists in the pre-trained weights but just gets concealed in the upstream pre-training stage. To bring the useful knowledge back into light, we first identify a set of parameters that are important to a given downstream task, then attach a binary mask to each parameter, and finally optimize these masks on the downstream data with the parameters frozen. When updating the mask, we introduce a novel gradient dropout strategy to regularize the parameter selection, in order to prevent the model from forgetting old knowledge and overfitting the downstream data. Experimental results on 11 datasets demonstrate the consistent superiority of our method over previous alternatives. It is noteworthy that we manage to deliver 18.73% performance improvement compared to the zero-shot CLIP via masking an average of only 2.56% parameters. Furthermore, our method is synergistic with most existing parameter-efficient tuning methods and can boost the performance on top of them. Project page can be found here (https://wuw2019.github.io/R-AMT/).

  • 9 authors
·
Jul 27, 2023

Revision of the Phenomenological Characteristics of the Algol-Type Stars Using the NAV Algorithm

Phenomenological characteristics of the sample of the Algol-type stars are revised using a recently developed NAV ("New Algol Variable") algorithm (2012Ap.....55..536A, 2012arXiv 1212.6707A) and compared to that obtained using common methods of Trigonometric Polynomial Fit (TP) or local Algebraic Polynomial (A) fit of a fixed or (alternately) statistically optimal degree (1994OAP.....7...49A, 2003ASPC..292..391A). The computer program NAV is introduced, which allows to determine the best fit with 7 "linear" and 5 "non-linear" parameters and their error estimates. The number of parameters is much smaller than for the TP fit (typically 20-40, depending on the width of the eclipse, and is much smaller (5-20) for the W UMa and beta Lyrae - type stars. This causes more smooth approximation taking into account the reflection and ellipsoidal effects (TP2) and generally different shapes of the primary and secondary eclipses. An application of the method to two-color CCD photometry to the recently discovered eclipsing variable 2MASS J18024395 + 4003309 = VSX J180243.9 +400331 (2015JASS...32..101A) allowed to make estimates of the physical parameters of the binary system based on the phenomenological parameters of the light curve. The phenomenological parameters of the light curves were determined for the sample of newly discovered EA and EW - type stars (VSX J223429.3+552903, VSX J223421.4+553013, VSX J223416.2+553424, US-NO-B1.0 1347-0483658, UCAC3-191-085589, VSX J180755.6+074711= UCAC3 196-166827). Despite we have used original observations published by the discoverers, the accuracy estimates of the period using the NAV method are typically better than the original ones.

  • 3 authors
·
Nov 30, 2015

BiPer: Binary Neural Networks using a Periodic Function

Quantized neural networks employ reduced precision representations for both weights and activations. This quantization process significantly reduces the memory requirements and computational complexity of the network. Binary Neural Networks (BNNs) are the extreme quantization case, representing values with just one bit. Since the sign function is typically used to map real values to binary values, smooth approximations are introduced to mimic the gradients during error backpropagation. Thus, the mismatch between the forward and backward models corrupts the direction of the gradient, causing training inconsistency problems and performance degradation. In contrast to current BNN approaches, we propose to employ a binary periodic (BiPer) function during binarization. Specifically, we use a square wave for the forward pass to obtain the binary values and employ the trigonometric sine function with the same period of the square wave as a differentiable surrogate during the backward pass. We demonstrate that this approach can control the quantization error by using the frequency of the periodic function and improves network performance. Extensive experiments validate the effectiveness of BiPer in benchmark datasets and network architectures, with improvements of up to 1% and 0.69% with respect to state-of-the-art methods in the classification task over CIFAR-10 and ImageNet, respectively. Our code is publicly available at https://github.com/edmav4/BiPer.

  • 4 authors
·
Apr 1, 2024

Binary and Ternary Natural Language Generation

Ternary and binary neural networks enable multiplication-free computation and promise multiple orders of magnitude efficiency gains over full-precision networks if implemented on specialized hardware. However, since both the parameter and the output space are highly discretized, such networks have proven very difficult to optimize. The difficulties are compounded for the class of transformer text generation models due to the sensitivity of the attention operation to quantization and the noise-compounding effects of autoregressive decoding in the high-cardinality output space. We approach the problem with a mix of statistics-based quantization for the weights and elastic quantization of the activations and demonstrate the first ternary and binary transformer models on the downstream tasks of summarization and machine translation. Our ternary BART base achieves an R1 score of 41 on the CNN/DailyMail benchmark, which is merely 3.9 points behind the full model while being 16x more efficient. Our binary model, while less accurate, achieves a highly non-trivial score of 35.6. For machine translation, we achieved BLEU scores of 21.7 and 17.6 on the WMT16 En-Ro benchmark, compared with a full precision mBART model score of 26.8. We also compare our approach in the 8-bit activation setting, where our ternary and even binary weight models can match or outperform the best existing 8-bit weight models in the literature. Our code and models are available at: https://github.com/facebookresearch/Ternary_Binary_Transformer

  • 5 authors
·
Jun 2, 2023

Assemblage: Automatic Binary Dataset Construction for Machine Learning

Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpuses of malicious binaries, obtaining high-quality corpuses of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpuses (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish "recipes" for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpuses of high-quality Windows PE binaries in training modern learning-based binary analyses. Assemblage can be downloaded from https://assemblage-dataset.net

  • 8 authors
·
May 7, 2024

Machine learning-driven Anomaly Detection and Forecasting for Euclid Space Telescope Operations

State-of-the-art space science missions increasingly rely on automation due to spacecraft complexity and the costs of human oversight. The high volume of data, including scientific and telemetry data, makes manual inspection challenging. Machine learning offers significant potential to meet these demands. The Euclid space telescope, in its survey phase since February 2024, exemplifies this shift. Euclid's success depends on accurate monitoring and interpretation of housekeeping telemetry and science-derived data. Thousands of telemetry parameters, monitored as time series, may or may not impact the quality of scientific data. These parameters have complex interdependencies, often due to physical relationships (e.g., proximity of temperature sensors). Optimising science operations requires careful anomaly detection and identification of hidden parameter states. Moreover, understanding the interactions between known anomalies and physical quantities is crucial yet complex, as related parameters may display anomalies with varied timing and intensity. We address these challenges by analysing temperature anomalies in Euclid's telemetry from February to August 2024, focusing on eleven temperature parameters and 35 covariates. We use a predictive XGBoost model to forecast temperatures based on historical values, detecting anomalies as deviations from predictions. A second XGBoost model predicts anomalies from covariates, capturing their relationships to temperature anomalies. We identify the top three anomalies per parameter and analyse their interactions with covariates using SHAP (Shapley Additive Explanations), enabling rapid, automated analysis of complex parameter relationships. Our method demonstrates how machine learning can enhance telemetry monitoring, offering scalable solutions for other missions with similar data challenges.

  • 6 authors
·
Nov 8, 2024

Improve Machine Learning carbon footprint using Nvidia GPU and Mixed Precision training for classification models -- Part I

This is the 1st part of the dissertation for my master degree and compares the power consumption using the default floating point (32bit) and Nvidia mixed precision (16bit and 32bit) while training a classification ML model. A custom PC with specific hardware was built to perform the experiments, and different ML hyper-parameters, such as batch size, neurons, and epochs, were chosen to build Deep Neural Networks (DNN). Additionally, various software was used during the experiments to collect the power consumption data in Watts from the Graphics Processing Unit (GPU), Central Processing Unit (CPU), Random Access Memory (RAM) and manually from a wattmeter connected to the wall. A benchmarking test with default hyper parameter values for the DNN was used as a reference, while the experiments used a combination of different settings. The results were recorded in Excel, and descriptive statistics were chosen to calculate the mean between the groups and compare them using graphs and tables. The outcome was positive when using mixed precision combined with specific hyper-parameters. Compared to the benchmarking, the optimisation for the classification reduced the power consumption between 7 and 11 Watts. Similarly, the carbon footprint is reduced because the calculation uses the same power consumption data. Still, a consideration is required when configuring hyper-parameters because it can negatively affect hardware performance. However, this research required inferential statistics, specifically ANOVA and T-test, to compare the relationship between the means. Furthermore, tests indicated no statistical significance of the relationship between the benchmarking and experiments. However, a more extensive implementation with a cluster of GPUs can increase the sample size significantly, as it is an essential factor and can change the outcome of the statistical analysis.

  • 1 authors
·
Sep 12, 2024

BinaryDM: Towards Accurate Binarization of Diffusion Model

With the advancement of diffusion models (DMs) and the substantially increased computational requirements, quantization emerges as a practical solution to obtain compact and efficient low-bit DMs. However, the highly discrete representation leads to severe accuracy degradation, hindering the quantization of diffusion models to ultra-low bit-widths. In this paper, we propose BinaryDM, a novel accurate quantization-aware training approach to push the weights of diffusion models towards the limit of 1-bit. Firstly, we present a Learnable Multi-basis Binarizer (LMB) to recover the representations generated by the binarized DM, which improves the information in details of representations crucial to the DM. Secondly, a Low-rank Representation Mimicking (LRM) is applied to enhance the binarization-aware optimization of the DM, alleviating the optimization direction ambiguity caused by fine-grained alignment. Moreover, a progressive initialization strategy is applied to training DMs to avoid convergence difficulties. Comprehensive experiments demonstrate that BinaryDM achieves significant accuracy and efficiency gains compared to SOTA quantization methods of DMs under ultra-low bit-widths. As the first binarization method for diffusion models, BinaryDM achieves impressive 16.0 times FLOPs and 27.1 times storage savings with 1-bit weight and 4-bit activation, showcasing its substantial advantages and potential for deploying DMs on resource-limited scenarios.

  • 9 authors
·
Apr 8, 2024

Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation

Recent advances in LLM-based decompilers have been shown effective to convert low-level binaries into human-readable source code. However, there still lacks a comprehensive benchmark that provides large-scale binary-source function pairs, which is critical for advancing the LLM decompilation technology. Creating accurate binary-source mappings incurs severe issues caused by complex compilation settings and widespread function inlining that obscure the correspondence between binaries and their original source code. Previous efforts have either relied on used contest-style benchmarks, synthetic binary-source mappings that diverge significantly from the mappings in real world, or partially matched binaries with only code lines or variable names, compromising the effectiveness of analyzing the binary functionality. To alleviate these issues, we introduce Decompile-Bench, the first open-source dataset comprising two million binary-source function pairs condensed from 100 million collected function pairs, i.e., 450GB of binaries compiled from permissively licensed GitHub projects. For the evaluation purposes, we also developed a benchmark Decompile-Bench-Eval including manually crafted binaries from the well-established HumanEval and MBPP, alongside the compiled GitHub repositories released after 2025 to mitigate data leakage issues. We further explore commonly-used evaluation metrics to provide a thorough assessment of the studied LLM decompilers and find that fine-tuning with Decompile-Bench causes a 20% improvement over previous benchmarks in terms of the re-executability rate. Our code and data has been released in HuggingFace and Github. https://github.com/albertan017/LLM4Decompile

  • 9 authors
·
May 18

Measuring the Intrinsic Dimension of Objective Landscapes

Many recently trained neural networks employ large numbers of parameters to achieve good performance. One may intuitively use the number of parameters required as a rough gauge of the difficulty of a problem. But how accurate are such notions? How many parameters are really needed? In this paper we attempt to answer this question by training networks not in their native parameter space, but instead in a smaller, randomly oriented subspace. We slowly increase the dimension of this subspace, note at which dimension solutions first appear, and define this to be the intrinsic dimension of the objective landscape. The approach is simple to implement, computationally tractable, and produces several suggestive conclusions. Many problems have smaller intrinsic dimensions than one might suspect, and the intrinsic dimension for a given dataset varies little across a family of models with vastly different sizes. This latter result has the profound implication that once a parameter space is large enough to solve a problem, extra parameters serve directly to increase the dimensionality of the solution manifold. Intrinsic dimension allows some quantitative comparison of problem difficulty across supervised, reinforcement, and other types of learning where we conclude, for example, that solving the inverted pendulum problem is 100 times easier than classifying digits from MNIST, and playing Atari Pong from pixels is about as hard as classifying CIFAR-10. In addition to providing new cartography of the objective landscapes wandered by parameterized models, the method is a simple technique for constructively obtaining an upper bound on the minimum description length of a solution. A byproduct of this construction is a simple approach for compressing networks, in some cases by more than 100 times.

  • 4 authors
·
Apr 24, 2018

BiBench: Benchmarking and Analyzing Network Binarization

Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings by minimizing the bit-width. However, recent research has shown that applying existing binarization algorithms to diverse tasks, architectures, and hardware in realistic scenarios is still not straightforward. Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood. To close this gap, we present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization. We first carefully scrutinize the requirements of binarization in the actual production and define evaluation tracks and metrics for a comprehensive and fair investigation. Then, we evaluate and analyze a series of milestone binarization algorithms that function at the operator level and with extensive influence. Our benchmark reveals that 1) the binarized operator has a crucial impact on the performance and deployability of binarized networks; 2) the accuracy of binarization varies significantly across different learning tasks and neural architectures; 3) binarization has demonstrated promising efficiency potential on edge devices despite the limited hardware support. The results and analysis also lead to a promising paradigm for accurate and efficient binarization. We believe that BiBench will contribute to the broader adoption of binarization and serve as a foundation for future research. The code for our BiBench is released https://github.com/htqin/BiBench .

  • 8 authors
·
Jan 26, 2023

BiPFT: Binary Pre-trained Foundation Transformer with Low-rank Estimation of Binarization Residual Polynomials

Pretrained foundation models offer substantial benefits for a wide range of downstream tasks, which can be one of the most potential techniques to access artificial general intelligence. However, scaling up foundation transformers for maximal task-agnostic knowledge has brought about computational challenges, especially on resource-limited devices such as mobiles. This work proposes the first Binary Pretrained Foundation Transformer (BiPFT) for natural language understanding (NLU) tasks, which remarkably saves 56 times operations and 28 times memory. In contrast to previous task-specific binary transformers, BiPFT exhibits a substantial enhancement in the learning capabilities of binary neural networks (BNNs), promoting BNNs into the era of pre-training. Benefiting from extensive pretraining data, we further propose a data-driven binarization method. Specifically, we first analyze the binarization error in self-attention operations and derive the polynomials of binarization error. To simulate full-precision self-attention, we define binarization error as binarization residual polynomials, and then introduce low-rank estimators to model these polynomials. Extensive experiments validate the effectiveness of BiPFTs, surpassing task-specific baseline by 15.4% average performance on the GLUE benchmark. BiPFT also demonstrates improved robustness to hyperparameter changes, improved optimization efficiency, and reduced reliance on downstream distillation, which consequently generalize on various NLU tasks and simplify the downstream pipeline of BNNs. Our code and pretrained models are publicly available at https://github.com/Xingrun-Xing/BiPFT.

  • 7 authors
·
Dec 14, 2023

A new sample of massive B-type contact binary candidates from the OGLE survey of the Magellanic Clouds

Massive contact binaries (CBs) are key to understanding close-binary evolution and stellar mergers, yet their study has been limited by the scarcity of observed systems, particularly of B-type binaries expected to dominate this class. We bridge this gap by mining a large sample of massive CB candidates from the OGLE-IV database, increasing their known numbers in the Magellanic Clouds by nearly an order of magnitude. Using main-sequence colour-magnitude limits, an observationally informed period-luminosity-colour relation for CBs, and a high morph-parameter cut (cgeq0.7), we identified 68 O- and B-type binaries that exhibit smooth, sinusoidal light curves with nearly equal eclipse depths. We then isolated a bona fide sample of 37 CB candidates (28 in the LMC and 9 in the SMC) that match theoretical colour-magnitude and period distributions derived from an extensive grid of MESA binary models. The bona fide sample, dominated by B-type systems with Papprox0.6-1 d, agrees with the predicted population and may contain many qapprox1 binaries, as expected from models showing mass equalization preceding temperature equalization during nuclear-timescale contact. Synthetic PHOEBE light curves of contact and near-contact phases of MESA models reveal a degeneracy between these configurations, suggesting possible misidentifications among these systems. Spectroscopic follow-up is required to test these predictions and refine the evolutionary framework of massive CBs.

  • 5 authors
·
Oct 21, 2024

A Flexible Parametric Modelling Framework for Survival Analysis

We introduce a general, flexible, parametric survival modelling framework which encompasses key shapes of hazard function (constant, increasing, decreasing, up-then-down, down-then-up), various common survival distributions (log-logistic, Burr type XII, Weibull, Gompertz), and includes defective distributions (i.e., cure models). This generality is achieved using four basic distributional parameters: two scale-type parameters and two shape parameters. Generalising to covariate dependence, the scale-type regression components correspond to accelerated failure time (AFT) and proportional hazards (PH) models. Therefore, this general formulation unifies the most popular survival models which allows us to consider the practical value of possible modelling choices for survival data. Furthermore, in line with our proposed flexible baseline distribution, we advocate the use of multi-parameter regression in which more than one distributional parameter depends on covariates - rather than the usual convention of having a single covariate-dependent (scale) parameter. While many choices are available, we suggest introducing covariates through just one or other of the two scale parameters, which covers AFT and PH models, in combination with a `power' shape parameter, which allows for more complex non-AFT/non-PH effects, while the other shape parameter remains covariate-independent, and handles automatic selection of the baseline distribution. We explore inferential issues in simulations, both with and without a covariate, with particular focus on evidence concerning the need, or otherwise, to include both AFT and PH parameters. We illustrate the efficacy of our modelling framework by investigating differences between treatment groups using data from a lung cancer study and a melanoma study. Censoring is accommodated throughout.

  • 3 authors
·
Jan 10, 2019

Light Schrödinger Bridge

Despite the recent advances in the field of computational Schr\"odinger Bridges (SB), most existing SB solvers are still heavy-weighted and require complex optimization of several neural networks. It turns out that there is no principal solver which plays the role of simple-yet-effective baseline for SB just like, e.g., k-means method in clustering, logistic regression in classification or Sinkhorn algorithm in discrete optimal transport. We address this issue and propose a novel fast and simple SB solver. Our development is a smart combination of two ideas which recently appeared in the field: (a) parameterization of the Schr\"odinger potentials with sum-exp quadratic functions and (b) viewing the log-Schr\"odinger potentials as the energy functions. We show that combined together these ideas yield a lightweight, simulation-free and theoretically justified SB solver with a simple straightforward optimization objective. As a result, it allows solving SB in moderate dimensions in a matter of minutes on CPU without a painful hyperparameter selection. Our light solver resembles the Gaussian mixture model which is widely used for density estimation. Inspired by this similarity, we also prove an important theoretical result showing that our light solver is a universal approximator of SBs. Furthemore, we conduct the analysis of the generalization error of our light solver. The code for our solver can be found at https://github.com/ngushchin/LightSB

  • 3 authors
·
Oct 2, 2023

Understanding the Impact of Post-Training Quantization on Large Language Models

Large language models (LLMs) are rapidly increasing in size, with the number of parameters becoming a key factor in the success of many commercial models, such as ChatGPT, Claude, and Bard. Even the recently released publicly accessible models for commercial usage, such as Falcon and Llama2, come equipped with billions of parameters. This significant increase in the number of parameters makes deployment and operation very costly. The remarkable progress in the field of quantization for large neural networks in general and LLMs in particular, has made these models more accessible by enabling them to be deployed on consumer-grade GPUs. Quantized models generally demonstrate comparable performance levels to their unquantized base counterparts. Nonetheless, there exists a notable gap in our comprehensive understanding of how these quantized models respond to hyperparameters, such as temperature, max new tokens, and topk, particularly for next word prediction. The present analysis reveals that nf4 and fp4 are equally proficient 4-bit quantization techniques, characterized by similar attributes such as inference speed, memory consumption, and the quality of generated content. the study identifies nf4 as displaying greater resilience to temperature variations in the case of the llama2 series of models at lower temperature, while fp4 and fp4-dq proves to be a more suitable choice for falcon series of models. It is noteworthy that, in general, 4-bit quantized models of varying sizes exhibit higher sensitivity to temperature in the range of 0.5 to 0.8, unlike their unquantized counterparts. Additionally, int8 quantization is associated with significantly slower inference speeds, whereas unquantized bfloat16 models consistently yield the fastest inference speeds across models of all sizes.

  • 1 authors
·
Sep 10, 2023

The Price of Differential Privacy under Continual Observation

We study the accuracy of differentially private mechanisms in the continual release model. A continual release mechanism receives a sensitive dataset as a stream of T inputs and produces, after receiving each input, an accurate output on the obtained inputs. In contrast, a batch algorithm receives the data as one batch and produces a single output. We provide the first strong lower bounds on the error of continual release mechanisms. In particular, for two fundamental problems that are widely studied and used in the batch model, we show that the worst case error of every continual release algorithm is tilde Omega(T^{1/3}) times larger than that of the best batch algorithm. Previous work shows only a polylogarithimic (in T) gap between the worst case error achievable in these two models; further, for many problems, including the summation of binary attributes, the polylogarithmic gap is tight (Dwork et al., 2010; Chan et al., 2010). Our results show that problems closely related to summation -- specifically, those that require selecting the largest of a set of sums -- are fundamentally harder in the continual release model than in the batch model. Our lower bounds assume only that privacy holds for streams fixed in advance (the "nonadaptive" setting). However, we provide matching upper bounds that hold in a model where privacy is required even for adaptively selected streams. This model may be of independent interest.

  • 4 authors
·
Dec 1, 2021

Binary Embedding-based Retrieval at Tencent

Large-scale embedding-based retrieval (EBR) is the cornerstone of search-related industrial applications. Given a user query, the system of EBR aims to identify relevant information from a large corpus of documents that may be tens or hundreds of billions in size. The storage and computation turn out to be expensive and inefficient with massive documents and high concurrent queries, making it difficult to further scale up. To tackle the challenge, we propose a binary embedding-based retrieval (BEBR) engine equipped with a recurrent binarization algorithm that enables customized bits per dimension. Specifically, we compress the full-precision query and document embeddings, formulated as float vectors in general, into a composition of multiple binary vectors using a lightweight transformation model with residual multilayer perception (MLP) blocks. We can therefore tailor the number of bits for different applications to trade off accuracy loss and cost savings. Importantly, we enable task-agnostic efficient training of the binarization model using a new embedding-to-embedding strategy. We also exploit the compatible training of binary embeddings so that the BEBR engine can support indexing among multiple embedding versions within a unified system. To further realize efficient search, we propose Symmetric Distance Calculation (SDC) to achieve lower response time than Hamming codes. We successfully employed the introduced BEBR to Tencent products, including Sogou, Tencent Video, QQ World, etc. The binarization algorithm can be seamlessly generalized to various tasks with multiple modalities. Extensive experiments on offline benchmarks and online A/B tests demonstrate the efficiency and effectiveness of our method, significantly saving 30%~50% index costs with almost no loss of accuracy at the system level.

  • 10 authors
·
Feb 17, 2023

Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries

Reverse engineering binaries is required to understand and analyse programs for which the source code is unavailable. Decompilers can transform the largely unreadable binaries into a more readable source code-like representation. However, reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. While the automated summarisation of decompiled code can help Reverse Engineers understand and analyse binaries, current work mainly focuses on summarising source code, and no suitable dataset exists for this task. In this work, we extend large pre-trained language models of source code to summarise decompiled binary functions. Furthermore, we investigate the impact of input and data properties on the performance of such models. Our approach consists of two main components; the data and the model. We first build CAPYBARA, a dataset of 214K decompiled function-documentation pairs across various compiler optimisations. We extend CAPYBARA further by generating synthetic datasets and deduplicating the data. Next, we fine-tune the CodeT5 base model with CAPYBARA to create BinT5. BinT5 achieves the state-of-the-art BLEU-4 score of 60.83, 58.82, and 44.21 for summarising source, decompiled, and synthetically stripped decompiled code, respectively. This indicates that these models can be extended to decompiled binaries successfully. Finally, we found that the performance of BinT5 is not heavily dependent on the dataset size and compiler optimisation level. We recommend future research to further investigate transferring knowledge when working with less expressive input formats such as stripped binaries.

  • 6 authors
·
Jan 4, 2023

Procedural Generation of Grain Orientations using the Wave Function Collapse Algorithm

Statistics of grain sizes and orientations in metals correlate to the material's mechanical properties. Reproducing representative volume elements for further analysis of deformation and failure in metals, like 316L stainless steel, is particularly important due to their wide use in manufacturing goods today. Two approaches, initially created for video games, were considered for the procedural generation of representative grain microstructures. The first is the Wave Function Collapse (WFC) algorithm, and the second is constraint propagation and probabilistic inference through Markov Junior, a free and open-source software. This study aimed to investigate these two algorithms' effectiveness in using reference electron backscatter diffraction (EBSD) maps and recreating a statistically similar one that could be used in further research. It utilized two stainless steel EBSD maps as references to test both algorithms. First, the WFC algorithm was too constricting and, thus, incapable of producing images that resembled EBSDs. The second, MarkovJunior, was much more effective in creating a Voronoi tessellation that could be used to create an EBSD map in Python. When comparing the results between the reference and the generated EBSD, we discovered that the orientation and volume fractions were extremely similar. With the study, it was concluded that MarkovJunior is an effective machine learning tool that can reproduce representative grain microstructures.

  • 3 authors
·
Nov 20, 2023

How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models

Binary code analysis plays a pivotal role in various software security applications, such as software maintenance, malware detection, software vulnerability discovery, patch analysis, etc. However, unlike source code, understanding binary code is challenging for reverse engineers due to the absence of semantic information. Therefore, automated tools are needed to assist human players in interpreting binary code. In recent years, two groups of technologies have shown promising prospects: (1) Deep learning-based technologies have demonstrated competitive results in tasks related to binary code understanding, furthermore, (2) Large Language Models (LLMs) have been extensively pre-trained at the source-code level for tasks such as code understanding and generation. This makes participants wonder about the ability of LLMs in binary code understanding. In this work, we propose a benchmark to evaluate the effectiveness of LLMs in real-world reverse engineering scenarios. The benchmark covers two key binary code understanding tasks, including function name recovery and binary code summarization. We gain valuable insights into their capabilities and limitations through extensive evaluations of popular LLMs using our benchmark. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis. Our results highlight the great potential of the LLMs in advancing the field of binary code understanding.

  • 9 authors
·
Apr 15, 2024

Beyond Binary Gender Labels: Revealing Gender Biases in LLMs through Gender-Neutral Name Predictions

Name-based gender prediction has traditionally categorized individuals as either female or male based on their names, using a binary classification system. That binary approach can be problematic in the cases of gender-neutral names that do not align with any one gender, among other reasons. Relying solely on binary gender categories without recognizing gender-neutral names can reduce the inclusiveness of gender prediction tasks. We introduce an additional gender category, i.e., "neutral", to study and address potential gender biases in Large Language Models (LLMs). We evaluate the performance of several foundational and large language models in predicting gender based on first names only. Additionally, we investigate the impact of adding birth years to enhance the accuracy of gender prediction, accounting for shifting associations between names and genders over time. Our findings indicate that most LLMs identify male and female names with high accuracy (over 80%) but struggle with gender-neutral names (under 40%), and the accuracy of gender prediction is higher for English-based first names than non-English names. The experimental results show that incorporating the birth year does not improve the overall accuracy of gender prediction, especially for names with evolving gender associations. We recommend using caution when applying LLMs for gender identification in downstream tasks, particularly when dealing with non-binary gender labels.

  • 7 authors
·
Jul 7, 2024

KIC 4150611: A quadruply eclipsing heptuple star system with a g-mode period-spacing pattern Asteroseismic modelling of the g-mode period-spacing pattern

In this work, we aim to estimate the stellar parameters of the primary (Aa) by performing asteroseismic analysis on its period-spacing pattern. We use the C-3PO neural network to perform asteroseismic modelling of the g-mode period-spacing pattern of Aa, discussing the interplay of this information with external constraints from spectroscopy (T_{rm eff} and log(g)) and eclipse modelling (R). To estimate the level of uncertainty due to different frequency extraction and pattern identification processes, we consider four different variations on the period-spacing patterns. To better understand the correlations between and the uncertainty structure of our parameter estimates, we also employed a classical, parameter-based MCMC grid search on four different stellar grids. The best-fitting, externally constrained model to the period-spacing pattern arrives at estimates of the stellar properties for Aa of: M=1.51 pm 0.05 M_odot, X_c =0.43 pm 0.04, R=1.66 pm 0.1 R_odot, f_{rm ov}=0.010, Omega_c=1.58 pm 0.01 d^{-1} with rigid rotation to within the measurement errors, log(T_{rm eff})=3.856 pm 0.008 dex, log(g)=4.18 pm 0.04 dex, and log(L)=0.809 pm 0.005 dex, which agree well with previous measurements from eclipse modelling, spectroscopy, and the Gaia DR3 luminosity. We find that the near-core properties of the best-fitting asteroseismic models are consistent with external constraints from eclipse modelling and spectroscopy. Aa appears to be a typical example of a gamma Dor star, fitting well within existing populations. We find that Aa is quasi-rigidly rotating to within the uncertainties, and note that the asteroseismic age estimate for Aa (1100 pm 100 Myr) is considerably older than the young (35 Myr) age implied by previous isochrone fits to the B binary in the literature. Our MCMC parameter-based grid-search agrees well with our pattern-modelling approach.

  • 10 authors
·
Nov 27, 2024

Automated speech- and text-based classification of neuropsychiatric conditions in a multidiagnostic setting

Speech patterns have been identified as potential diagnostic markers for neuropsychiatric conditions. However, most studies only compare a single clinical group to healthy controls, whereas clinical practice often requires differentiating between multiple potential diagnoses (multiclass settings). To address this, we assembled a dataset of repeated recordings from 420 participants (67 with major depressive disorder, 106 with schizophrenia and 46 with autism, as well as matched controls), and tested the performance of a range of conventional machine learning models and advanced Transformer models on both binary and multiclass classification, based on voice and text features. While binary models performed comparably to previous research (F1 scores between 0.54-0.75 for autism spectrum disorder, ASD; 0.67-0.92 for major depressive disorder, MDD; and 0.71-0.83 for schizophrenia); when differentiating between multiple diagnostic groups performance decreased markedly (F1 scores between 0.35-0.44 for ASD, 0.57-0.75 for MDD, 0.15-0.66 for schizophrenia, and 0.38-0.52 macro F1). Combining voice and text-based models yielded increased performance, suggesting that they capture complementary diagnostic information. Our results indicate that models trained on binary classification may learn to rely on markers of generic differences between clinical and non-clinical populations, or markers of clinical features that overlap across conditions, rather than identifying markers specific to individual conditions. We provide recommendations for future research in the field, suggesting increased focus on developing larger transdiagnostic datasets that include more fine-grained clinical features, and that can support the development of models that better capture the complexity of neuropsychiatric conditions and naturalistic diagnostic assessment.

  • 11 authors
·
Jan 13, 2023

GWKokab: An Implementation to Identify the Properties of Multiple Population of Gravitational Wave Sources

The rapidly increasing sensitivity of gravitational wave detectors is enabling the detection of a growing number of compact binary mergers. These events are crucial for understanding the population properties of compact binaries. However, many previous studies rely on computationally expensive inference frameworks, limiting their scalability. In this work, we present GWKokab, a JAX-based framework that enables modular model building with independent rate for each subpopulation such as BBH, BNS, and NSBH binaries. It provides accelerated inference using the normalizing flow based sampler called flowMC and is also compatible with NumPyro samplers. To validate our framework, we generated two synthetic populations, one comprising spinning eccentric binaries and the other circular binaries using a multi-source model. We then recovered their injected parameters at significantly reduced computational cost and demonstrated that eccentricity distribution can be recovered even in spinning eccentric populations. We also reproduced results from two prior studies: one on non-spinning eccentric populations, and another on the BBH mass distribution using the third Gravitational Wave Transient Catalog (GWTC-3). We anticipate that GWKokab will not only reduce computational costs but also enable more detailed subpopulation analyses such as their mass, spin, eccentricity, and redshift distributions in gravitational wave events, offering deeper insights into compact binary formation and evolution.

  • 3 authors
·
Sep 16

DeepArchitect: Automatically Designing and Training Deep Architectures

In deep learning, performance is strongly affected by the choice of architecture and hyperparameters. While there has been extensive work on automatic hyperparameter optimization for simple spaces, complex spaces such as the space of deep architectures remain largely unexplored. As a result, the choice of architecture is done manually by the human expert through a slow trial and error process guided mainly by intuition. In this paper we describe a framework for automatically designing and training deep models. We propose an extensible and modular language that allows the human expert to compactly represent complex search spaces over architectures and their hyperparameters. The resulting search spaces are tree-structured and therefore easy to traverse. Models can be automatically compiled to computational graphs once values for all hyperparameters have been chosen. We can leverage the structure of the search space to introduce different model search algorithms, such as random search, Monte Carlo tree search (MCTS), and sequential model-based optimization (SMBO). We present experiments comparing the different algorithms on CIFAR-10 and show that MCTS and SMBO outperform random search. In addition, these experiments show that our framework can be used effectively for model discovery, as it is possible to describe expressive search spaces and discover competitive models without much effort from the human expert. Code for our framework and experiments has been made publicly available.

  • 2 authors
·
Apr 27, 2017

A Benchmark Study on Calibration

Deep neural networks are increasingly utilized in various machine learning tasks. However, as these models grow in complexity, they often face calibration issues, despite enhanced prediction accuracy. Many studies have endeavored to improve calibration performance through the use of specific loss functions, data preprocessing and training frameworks. Yet, investigations into calibration properties have been somewhat overlooked. Our study leverages the Neural Architecture Search (NAS) search space, offering an exhaustive model architecture space for thorough calibration properties exploration. We specifically create a model calibration dataset. This dataset evaluates 90 bin-based and 12 additional calibration measurements across 117,702 unique neural networks within the widely employed NATS-Bench search space. Our analysis aims to answer several longstanding questions in the field, using our proposed dataset: (i) Can model calibration be generalized across different datasets? (ii) Can robustness be used as a calibration measurement? (iii) How reliable are calibration metrics? (iv) Does a post-hoc calibration method affect all models uniformly? (v) How does calibration interact with accuracy? (vi) What is the impact of bin size on calibration measurement? (vii) Which architectural designs are beneficial for calibration? Additionally, our study bridges an existing gap by exploring calibration within NAS. By providing this dataset, we enable further research into NAS calibration. As far as we are aware, our research represents the first large-scale investigation into calibration properties and the premier study of calibration issues within NAS. The project page can be found at https://www.taolinwei.com/calibration-study

  • 5 authors
·
Aug 22, 2023

Estimator Meets Equilibrium Perspective: A Rectified Straight Through Estimator for Binary Neural Networks Training

Binarization of neural networks is a dominant paradigm in neural networks compression. The pioneering work BinaryConnect uses Straight Through Estimator (STE) to mimic the gradients of the sign function, but it also causes the crucial inconsistency problem. Most of the previous methods design different estimators instead of STE to mitigate it. However, they ignore the fact that when reducing the estimating error, the gradient stability will decrease concomitantly. These highly divergent gradients will harm the model training and increase the risk of gradient vanishing and gradient exploding. To fully take the gradient stability into consideration, we present a new perspective to the BNNs training, regarding it as the equilibrium between the estimating error and the gradient stability. In this view, we firstly design two indicators to quantitatively demonstrate the equilibrium phenomenon. In addition, in order to balance the estimating error and the gradient stability well, we revise the original straight through estimator and propose a power function based estimator, Rectified Straight Through Estimator (ReSTE for short). Comparing to other estimators, ReSTE is rational and capable of flexibly balancing the estimating error with the gradient stability. Extensive experiments on CIFAR-10 and ImageNet datasets show that ReSTE has excellent performance and surpasses the state-of-the-art methods without any auxiliary modules or losses.

  • 4 authors
·
Aug 13, 2023

Differentially Private Sequential Learning

In a differentially private sequential learning setting, agents introduce endogenous noise into their actions to maintain privacy. Applying this to a standard sequential learning model leads to different outcomes for continuous vs. binary signals. For continuous signals with a nonzero privacy budget, we introduce a novel smoothed randomized response mechanism that adapts noise based on distance to a threshold, unlike traditional randomized response, which applies uniform noise. This enables agents' actions to better reflect both private signals and observed history, accelerating asymptotic learning speed to Theta_{epsilon}(log(n)), compared to Theta(log(n)) in the non-private regime where privacy budget is infinite. Moreover, in the non-private setting, the expected stopping time for the first correct decision and the number of incorrect actions diverge, meaning early agents may make mistakes for an unreasonably long period. In contrast, under a finite privacy budget epsilon in (0,1), both remain finite, highlighting a stark contrast between private and non-private learning. Learning with continuous signals in the private regime is more efficient, as smooth randomized response enhances the log-likelihood ratio over time, improving information aggregation. Conversely, for binary signals, differential privacy noise hinders learning, as agents tend to use a constant randomized response strategy before an information cascade forms, reducing action informativeness and hampering the overall process.

  • 2 authors
·
Feb 26

Learning to Relax: Setting Solver Parameters Across a Sequence of Linear System Instances

Solving a linear system Ax=b is a fundamental scientific computing primitive for which numerous solvers and preconditioners have been developed. These come with parameters whose optimal values depend on the system being solved and are often impossible or too expensive to identify; thus in practice sub-optimal heuristics are used. We consider the common setting in which many related linear systems need to be solved, e.g. during a single numerical simulation. In this scenario, can we sequentially choose parameters that attain a near-optimal overall number of iterations, without extra matrix computations? We answer in the affirmative for Successive Over-Relaxation (SOR), a standard solver whose parameter omega has a strong impact on its runtime. For this method, we prove that a bandit online learning algorithm--using only the number of iterations as feedback--can select parameters for a sequence of instances such that the overall cost approaches that of the best fixed omega as the sequence length increases. Furthermore, when given additional structural information, we show that a contextual bandit method asymptotically achieves the performance of the instance-optimal policy, which selects the best omega for each instance. Our work provides the first learning-theoretic treatment of high-precision linear system solvers and the first end-to-end guarantees for data-driven scientific computing, demonstrating theoretically the potential to speed up numerical methods using well-understood learning algorithms.

  • 4 authors
·
Oct 3, 2023

Parameter Competition Balancing for Model Merging

While fine-tuning pretrained models has become common practice, these models often underperform outside their specific domains. Recently developed model merging techniques enable the direct integration of multiple models, each fine-tuned for distinct tasks, into a single model. This strategy promotes multitasking capabilities without requiring retraining on the original datasets. However, existing methods fall short in addressing potential conflicts and complex correlations between tasks, especially in parameter-level adjustments, posing a challenge in effectively balancing parameter competition across various tasks. This paper introduces an innovative technique named PCB-Merging (Parameter Competition Balancing), a lightweight and training-free technique that adjusts the coefficients of each parameter for effective model merging. PCB-Merging employs intra-balancing to gauge parameter significance within individual tasks and inter-balancing to assess parameter similarities across different tasks. Parameters with low importance scores are dropped, and the remaining ones are rescaled to form the final merged model. We assessed our approach in diverse merging scenarios, including cross-task, cross-domain, and cross-training configurations, as well as out-of-domain generalization. The experimental results reveal that our approach achieves substantial performance enhancements across multiple modalities, domains, model sizes, number of tasks, fine-tuning forms, and large language models, outperforming existing model merging methods. The code is publicly available at: https://github.com/duguodong7/pcb-merging.

  • 11 authors
·
Oct 3, 2024

BiBERT: Accurate Fully Binarized BERT

The large pre-trained BERT has achieved remarkable performance on Natural Language Processing (NLP) tasks but is also computation and memory expensive. As one of the powerful compression approaches, binarization extremely reduces the computation and memory consumption by utilizing 1-bit parameters and bitwise operations. Unfortunately, the full binarization of BERT (i.e., 1-bit weight, embedding, and activation) usually suffer a significant performance drop, and there is rare study addressing this problem. In this paper, with the theoretical justification and empirical analysis, we identify that the severe performance drop can be mainly attributed to the information degradation and optimization direction mismatch respectively in the forward and backward propagation, and propose BiBERT, an accurate fully binarized BERT, to eliminate the performance bottlenecks. Specifically, BiBERT introduces an efficient Bi-Attention structure for maximizing representation information statistically and a Direction-Matching Distillation (DMD) scheme to optimize the full binarized BERT accurately. Extensive experiments show that BiBERT outperforms both the straightforward baseline and existing state-of-the-art quantized BERTs with ultra-low bit activations by convincing margins on the NLP benchmark. As the first fully binarized BERT, our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size, demonstrating the vast advantages and potential of the fully binarized BERT model in real-world resource-constrained scenarios.

  • 8 authors
·
Mar 12, 2022

Investigating cannibalistic millisecond pulsar binaries using MESA: New constraints from pulsar spin and mass evolution

Compact binary millisecond pulsars (MSPs) with orbital periods lesssim1d are key to understanding binary evolution involving massive neutron stars (NSs). Due to the ablation of the companion by the rapidly spinning pulsar, these systems are also known as spiders and categorized into two main branches: redbacks (RBs; companion mass in the range of 0.1 to 0.5\,\Msun) and black widows (BWs; companion mass lesssim\,0.1\,\Msun). We present models of low- and intermediate-mass X-ray binaries and compare them with observations of Galactic spiders (including the presence or absence of hydrogen lines in their optical spectra), and we constrain and quantify the interaction between the pulsar and the companion. Using MESA, we created the allowed initial parameter space. For the first time in MESA, we also included the detailed evolution of the pulsar spin and modeled the irradiation of the companion by the pulsar wind. Efficient mass accretion onto the NS (at least 70% of the mass transferred is accreted) with an X-ray irradiated disk followed by strong irradiation of the companion can explain most of the properties of the observed spiders. Our RB evolutionary tracks continue to the BW regime, connecting the two branches of spiders. Our models explain the lack of hydrogen in some observed BWs with ultra-light companions. During accretion induced spin up, the mass required to spin up an NS to sub-milliseconds is high enough to collapse it into a black hole. Finally, after analyzing the formation of RB-like spiders with giant companions and orbital periods of several days (huntsmen), we conclude that they are unlikely to produce super-massive NSs (maximum accreted mass lesssim0.5M_{odot}). Cannibalistic MSP binary formation depends heavily on the interplay between accretion onto the pulsar and pulsar wind irradiation.

  • 3 authors
·
Aug 28, 2024

Un-Mixing Test-Time Normalization Statistics: Combatting Label Temporal Correlation

Recent test-time adaptation methods heavily rely on nuanced adjustments of batch normalization (BN) parameters. However, one critical assumption often goes overlooked: that of independently and identically distributed (i.i.d.) test batches with respect to unknown labels. This oversight leads to skewed BN statistics and undermines the reliability of the model under non-i.i.d. scenarios. To tackle this challenge, this paper presents a novel method termed 'Un-Mixing Test-Time Normalization Statistics' (UnMix-TNS). Our method re-calibrates the statistics for each instance within a test batch by mixing it with multiple distinct statistics components, thus inherently simulating the i.i.d. scenario. The core of this method hinges on a distinctive online unmixing procedure that continuously updates these statistics components by incorporating the most similar instances from new test batches. Remarkably generic in its design, UnMix-TNS seamlessly integrates with a wide range of leading test-time adaptation methods and pre-trained architectures equipped with BN layers. Empirical evaluations corroborate the robustness of UnMix-TNS under varied scenarios-ranging from single to continual and mixed domain shifts, particularly excelling with temporally correlated test data and corrupted non-i.i.d. real-world streams. This adaptability is maintained even with very small batch sizes or single instances. Our results highlight UnMix-TNS's capacity to markedly enhance stability and performance across various benchmarks. Our code is publicly available at https://github.com/devavratTomar/unmixtns.

  • 4 authors
·
Jan 16, 2024

Systematic Optimization of Open Source Large Language Models for Mathematical Reasoning

This paper presents a practical investigation into fine-tuning model parameters for mathematical reasoning tasks through experimenting with various configurations including randomness control, reasoning depth, and sampling strategies, careful tuning demonstrates substantial improvements in efficiency as well as performance. A holistically optimized framework is introduced for five state-of-the-art models on mathematical reasoning tasks, exhibiting significant performance boosts while maintaining solution correctness. Through systematic parameter optimization across Qwen2.5-72B, Llama-3.1-70B, DeepSeek-V3, Mixtral-8x22B, and Yi-Lightning, consistent efficiency gains are demonstrated with 100% optimization success rate. The methodology achieves an average 29.4% reduction in computational cost and 23.9% improvement in inference speed across all tested models. This framework systematically searches parameter spaces including temperature (0.1-0.5), reasoning steps (4-12), planning periods (1-4), and nucleus sampling (0.85-0.98), determining optimal configurations through testing on mathematical reasoning benchmarks. Critical findings show that lower temperature regimes (0.1-0.4) and reduced reasoning steps (4-6) consistently enhance efficiency without compromising accuracy. DeepSeek-V3 achieves the highest accuracy at 98%, while Mixtral-8x22B delivers the most cost-effective performance at 361.5 tokens per accurate response. Key contributions include: (1) the first comprehensive optimization study for five diverse SOTA models in mathematical reasoning, (2) a standardized production-oriented parameter optimization framework, (3) discovery of universal optimization trends applicable across model architectures, and (4) production-ready configurations with extensive performance characterization.

  • 6 authors
·
Sep 8

On filter design in deep convolutional neural network

The deep convolutional neural network (DCNN) in computer vision has given promising results. It is widely applied in many areas, from medicine, agriculture, self-driving car, biometric system, and almost all computer vision-based applications. Filters or weights are the critical elements responsible for learning in DCNN. Backpropagation has been the primary learning algorithm for DCNN and provides promising results, but the size and numbers of the filters remain hyper-parameters. Various studies have been done in the last decade on semi-supervised, self-supervised, and unsupervised methods and their properties. The effects of filter initialization, size-shape selection, and the number of filters on learning and optimization have not been investigated in a separate publication to collate all the options. Such attributes are often treated as hyper-parameters and lack mathematical understanding. Computer vision algorithms have many limitations in real-life applications, and understanding the learning process is essential to have some significant improvement. To the best of our knowledge, no separate investigation has been published discussing the filters; this is our primary motivation. This study focuses on arguments for choosing specific physical parameters of filters, initialization, and learning technic over scattered methods. The promising unsupervised approaches have been evaluated. Additionally, the limitations, current challenges, and future scope have been discussed in this paper.

  • 2 authors
·
Oct 28, 2024

Aioli: A Unified Optimization Framework for Language Model Data Mixing

Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity. To understand this inconsistency, we unify existing methods into a standard framework, showing they are equivalent to solving a common optimization problem: minimize average loss subject to a method-specific mixing law -- an implicit assumption on the relationship between loss and mixture proportions. This framework suggests that measuring the fidelity of a method's mixing law can offer insights into its performance. Empirically, we find that existing methods set their mixing law parameters inaccurately, resulting in the inconsistent mixing performance we observe. Using this insight, we derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions. Aioli outperforms stratified sampling on 6 out of 6 datasets by an average of 0.27 test perplexity points, whereas existing methods fail to consistently beat stratified sampling, doing up to 6.9 points worse. Moreover, in a practical setting where proportions are learned on shorter runs due to computational constraints, Aioli can dynamically adjust these proportions over the full training run, consistently improving performance over existing methods by up to 12.012 test perplexity points.

  • 5 authors
·
Nov 8, 2024 2

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Discrete diffusion models with absorbing processes have shown promise in language modeling. The key quantities to be estimated are the ratios between the marginal probabilities of two transitive states at all timesteps, called the concrete score. In this paper, we reveal that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data, multiplied by a time-dependent scalar in an analytic form. Motivated by this finding, we propose reparameterized absorbing discrete diffusion (RADD), a dedicated diffusion model without time-condition that characterizes the time-independent conditional probabilities. Besides its simplicity, RADD can reduce the number of function evaluations (NFEs) by caching the output of the time-independent network when the noisy sample remains unchanged in a sampling interval. Empirically, RADD is up to 3.5 times faster while achieving similar performance with the strongest baseline. Built upon the new perspective of conditional distributions, we further unify absorbing discrete diffusion and any-order autoregressive models (AO-ARMs), showing that the upper bound on the negative log-likelihood for the diffusion model can be interpreted as an expected negative log-likelihood for AO-ARMs. Further, our RADD models achieve SOTA performance among diffusion models on 5 zero-shot language modeling benchmarks (measured by perplexity) at the GPT-2 scale. Our code is available at https://github.com/ML-GSAI/RADD.

  • 7 authors
·
Jun 6, 2024

RegMix: Data Mixture as Regression for Language Model Pre-training

The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix involves training a set of small models with diverse data mixtures and fitting a regression model to predict their performance given their respective mixtures. With the fitted regression model, we simulate the top-ranked mixture and use it to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens of different mixtures to fit the regression model and find the optimal mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Further, our method demonstrates superior performance compared to human selection and achieves results that match or surpass DoReMi, while utilizing only 10% of the compute budget. Our experiments also show that (1) Data mixtures significantly impact performance with single-task performance variations of up to 14.6%; (2) Web corpora rather than data perceived as high-quality like Wikipedia have the strongest positive correlation with downstream performance; (3) Domains interact in complex ways often contradicting common sense, thus automatic approaches like RegMix are needed; (4) Data mixture effects transcend scaling laws, and our approach captures the complexity by considering all domains together. Our code is available at https://github.com/sail-sg/regmix.

  • 8 authors
·
Jul 1, 2024 7

DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation Models

Effective DNA embedding remains crucial in genomic analysis, particularly in scenarios lacking labeled data for model fine-tuning, despite the significant advancements in genome foundation models. A prime example is metagenomics binning, a critical process in microbiome research that aims to group DNA sequences by their species from a complex mixture of DNA sequences derived from potentially thousands of distinct, often uncharacterized species. To fill the lack of effective DNA embedding models, we introduce DNABERT-S, a genome foundation model that specializes in creating species-aware DNA embeddings. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C^2LR) strategy. Empirical results on 18 diverse datasets showed DNABERT-S's remarkable performance. It outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training while doubling the Adjusted Rand Index (ARI) in species clustering and substantially increasing the number of correctly identified species in metagenomics binning. The code, data, and pre-trained model are publicly available at https://github.com/Zhihan1996/DNABERT_S.

  • 8 authors
·
Feb 13, 2024

A Survey of Quantization Methods for Efficient Neural Network Inference

As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.

  • 6 authors
·
Mar 25, 2021

Parameter estimation from the core-bounce phase of rotating core collapse supernovae in real interferometer noise

In this work we propose an analytical model that reproduces the core-bounds phase of gravitational waves (GW) of Rapidly Rotating (RR) from Core Collapse Supernovae (CCSNe), as a function of three parameters, the arrival time tau, the ratio of the kinetic and potential energy beta and a phenomenological parameter alpha related to rotation and equation of state (EOS). To validate the model we use 126 waveforms from the Richers catalog Richers_2017 selected with the criteria of exploring a range of rotation profiles, and involving EOS. To quantify the degree of accuracy of the proposed model, with a particular focus on the rotation parameter beta, we show that the average Fitting Factor (FF) between the simulated waveforms with the templates is 94.4\%. In order to estimate the parameters we propose a frequentist matched filtering approach in real interferometric noise which does not require assigning any priors. We use the Matched Filter (MF) technique, where we inject a bank of templates considering simulated colored Gaussian noise and the real noise of O3L1. For example for A300w6.00\_BHBLP at 10Kpc we obtain a standar deviation of sigma = 3.34times 10^{-3} for simulated colored Gaussian noise and sigma= 1.46times 10^{-2} for real noise. On the other hand, from the asymptotic expansion of the variance we obtain the theoretical minimum error for beta at 10 kpc and optimal orientation. The estimation error in this case is from 10^{-2} to 10^{-3} as beta increases. We show that the results of the estimation error of beta for the 3-parameter space (3D) is consistent with the single-parameter space (1D), which allows us to conclude that beta is decoupled from the others two parameters.

  • 5 authors
·
Apr 3, 2023

SgrA* spin and mass estimates through the detection of multiple extremely large mass-ratio inspirals

We analyze the parameter estimation accuracy that can be achieved for the mass and spin of SgrA*, the SMBH in our Galactic Center, by detecting multiple extremely large mass-ratio inspirals (XMRIs). XMRIs are formed by brown dwarfs (BD) inspiraling into a supermassive black hole (SMBH), thus emitting gravitational waves (GWs) inside the detection band of future space-based detectors such as LISA and TianQin. Theoretical estimates suggest the presence of approximately 10 XMRIs emitting detectable GWs, making them some of the most promising candidates for space-based GW detectors. Our analysis indicates that even if individual sources have low SNRs (approx10), high-precision parameter estimates can still be achieved by detecting multiple sources. In this case, the accuracy of the parameter estimates increases by approximately one to two orders of magnitude, at least. Moreover, by analyzing a small sample of 400 initial conditions for XMRIs formed in the Galactic Center, we estimate that almost 80 % of the detectable XMRIs orbiting SgrA* will have eccentricities between 0.43 to 0.95 and an SNRin [10,100]. The remaining sim20 % of the sources have an SNRin [100,1000] and eccentricities ranging from 0.25 to 0.92. Additionally, some XMRIs with high SNR are far from being circular. These loud sources with SNRapprox 1000 can have eccentricities as high as eapprox0.7; although their detection chances are low, representing lesssim2 % of the detectable sources, their presence is not ruled out.

  • 3 authors
·
Dec 30, 2024

Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models

Post-training quantization is the leading method for addressing memory-related bottlenecks in LLM inference, but unfortunately, it suffers from significant performance degradation below 4-bit precision. An alternative approach involves training compressed models directly at a low bitwidth (e.g., binary or ternary models). However, the performance, training dynamics, and scaling trends of such models are not yet well understood. To address this issue, we train and openly release the Spectra LLM suite consisting of 54 language models ranging from 99M to 3.9B parameters, trained on 300B tokens. Spectra includes FloatLMs, post-training quantized QuantLMs (3, 4, 6, and 8 bits), and ternary LLMs (TriLMs) - our improved architecture for ternary language modeling, which significantly outperforms previously proposed ternary models of a given size (in bits), matching half-precision models at scale. For example, TriLM 3.9B is (bit-wise) smaller than the half-precision FloatLM 830M, but matches half-precision FloatLM 3.9B in commonsense reasoning and knowledge benchmarks. However, TriLM 3.9B is also as toxic and stereotyping as FloatLM 3.9B, a model six times larger in size. Additionally, TriLM 3.9B lags behind FloatLM in perplexity on validation splits and web-based corpora but performs better on less noisy datasets like Lambada and PennTreeBank. To enhance understanding of low-bitwidth models, we are releasing 500+ intermediate checkpoints of the Spectra suite at https://github.com/NolanoOrg/SpectraSuite{https://github.com/NolanoOrg/SpectraSuite}.

  • 5 authors
·
Jul 17, 2024 3

Binary Latent Diffusion

In this paper, we show that a binary latent space can be explored for compact yet expressive image representations. We model the bi-directional mappings between an image and the corresponding latent binary representation by training an auto-encoder with a Bernoulli encoding distribution. On the one hand, the binary latent space provides a compact discrete image representation of which the distribution can be modeled more efficiently than pixels or continuous latent representations. On the other hand, we now represent each image patch as a binary vector instead of an index of a learned cookbook as in discrete image representations with vector quantization. In this way, we obtain binary latent representations that allow for better image quality and high-resolution image representations without any multi-stage hierarchy in the latent space. In this binary latent space, images can now be generated effectively using a binary latent diffusion model tailored specifically for modeling the prior over the binary image representations. We present both conditional and unconditional image generation experiments with multiple datasets, and show that the proposed method performs comparably to state-of-the-art methods while dramatically improving the sampling efficiency to as few as 16 steps without using any test-time acceleration. The proposed framework can also be seamlessly scaled to 1024 times 1024 high-resolution image generation without resorting to latent hierarchy or multi-stage refinements.

  • 4 authors
·
Apr 10, 2023

Reinforcement Learning for Adaptive Time-Stepping in the Chaotic Gravitational Three-Body Problem

Many problems in astrophysics cover multiple orders of magnitude in spatial and temporal scales. While simulating systems that experience rapid changes in these conditions, it is essential to adapt the (time-) step size to capture the behavior of the system during those rapid changes and use a less accurate time step at other, less demanding, moments. We encounter three problems with traditional methods. Firstly, making such changes requires expert knowledge of the astrophysics as well as of the details of the numerical implementation. Secondly, some parameters that determine the time-step size are fixed throughout the simulation, which means that they do not adapt to the rapidly changing conditions of the problem. Lastly, we would like the choice of time-step size to balance accuracy and computation effort. We address these challenges with Reinforcement Learning by training it to select the time-step size dynamically. We use the integration of a system of three equal-mass bodies that move due to their mutual gravity as an example of its application. With our method, the selected integration parameter adapts to the specific requirements of the problem, both in terms of computation time and accuracy while eliminating the expert knowledge needed to set up these simulations. Our method produces results competitive to existing methods and improve the results found with the most commonly-used values of time-step parameter. This method can be applied to other integrators without further retraining. We show that this extrapolation works for variable time-step integrators but does not perform to the desired accuracy for fixed time-step integrators.

  • 2 authors
·
Feb 18

Exact Learning of Permutations for Nonzero Binary Inputs with Logarithmic Training Size and Quadratic Ensemble Complexity

The ability of an architecture to realize permutations is quite fundamental. For example, Large Language Models need to be able to correctly copy (and perhaps rearrange) parts of the input prompt into the output. Classical universal approximation theorems guarantee the existence of parameter configurations that solve this task but offer no insights into whether gradient-based algorithms can find them. In this paper, we address this gap by focusing on two-layer fully connected feed-forward neural networks and the task of learning permutations on nonzero binary inputs. We show that in the infinite width Neural Tangent Kernel (NTK) regime, an ensemble of such networks independently trained with gradient descent on only the k standard basis vectors out of 2^k - 1 possible inputs successfully learns any fixed permutation of length k with arbitrarily high probability. By analyzing the exact training dynamics, we prove that the network's output converges to a Gaussian process whose mean captures the ground truth permutation via sign-based features. We then demonstrate how averaging these runs (an "ensemble" method) and applying a simple rounding step yields an arbitrarily accurate prediction on any possible input unseen during training. Notably, the number of models needed to achieve exact learning with high probability (which we refer to as ensemble complexity) exhibits a linearithmic dependence on the input size k for a single test input and a quadratic dependence when considering all test inputs simultaneously.

  • 3 authors
·
Feb 23

Differentiability and Optimization of Multiparameter Persistent Homology

Real-valued functions on geometric data -- such as node attributes on a graph -- can be optimized using descriptors from persistent homology, allowing the user to incorporate topological terms in the loss function. When optimizing a single real-valued function (the one-parameter setting), there is a canonical choice of descriptor for persistent homology: the barcode. The operation mapping a real-valued function to its barcode is differentiable almost everywhere, and the convergence of gradient descent for losses using barcodes is relatively well understood. When optimizing a vector-valued function (the multiparameter setting), there is no unique choice of descriptor for multiparameter persistent homology, and many distinct descriptors have been proposed. This calls for the development of a general framework for differentiability and optimization that applies to a wide range of multiparameter homological descriptors. In this article, we develop such a framework and show that it encompasses well-known descriptors of different flavors, such as signed barcodes and the multiparameter persistence landscape. We complement the theory with numerical experiments supporting the idea that optimizing multiparameter homological descriptors can lead to improved performances compared to optimizing one-parameter descriptors, even when using the simplest and most efficiently computable multiparameter descriptors.

Learning Interactions Between Continuous Treatments and Covariates with a Semiparametric Model

Estimating the impact of continuous treatment variables (e.g., dosage amount) on binary outcomes presents significant challenges in modeling and estimation because many existing approaches make strong assumptions that do not hold for certain continuous treatment variables. For instance, traditional logistic regression makes strong linearity assumptions that do not hold for continuous treatment variables like time of initiation. In this work, we propose a semiparametric regression framework that decomposes effects into two interpretable components: a prognostic score that captures baseline outcome risk based on a combination of clinical, genetic, and sociodemographic features, and a treatment-interaction score that flexibly models the optimal treatment level via a nonparametric link function. By connecting these two parametric scores with Nadaraya-Watson regression, our approach is both interpretable and flexible. The potential of our approach is demonstrated through numerical simulations that show empirical estimation convergence. We conclude by applying our approach to a real-world case study using the International Warfarin Pharmacogenomics Consortium (IWPC) dataset to show our approach's clinical utility by deriving personalized warfarin dosing recommendations that integrate both genetic and clinical data, providing insights towards enhancing patient safety and therapeutic efficacy in anticoagulation therapy.

  • 3 authors
·
May 6

European Pulsar Timing Array Limits On An Isotropic Stochastic Gravitational-Wave Background

We present new limits on an isotropic stochastic gravitational-wave background (GWB) using a six pulsar dataset spanning 18 yr of observations from the 2015 European Pulsar Timing Array data release. Performing a Bayesian analysis, we fit simultaneously for the intrinsic noise parameters for each pulsar, along with common correlated signals including clock, and Solar System ephemeris errors, obtaining a robust 95% upper limit on the dimensionless strain amplitude A of the background of A<3.0times 10^{-15} at a reference frequency of 1yr^{-1} and a spectral index of 13/3, corresponding to a background from inspiralling super-massive black hole binaries, constraining the GW energy density to Omega_gw(f)h^2 < 1.1times10^{-9} at 2.8 nHz. We also present limits on the correlated power spectrum at a series of discrete frequencies, and show that our sensitivity to a fiducial isotropic GWB is highest at a frequency of sim 5times10^{-9}~Hz. Finally we discuss the implications of our analysis for the astrophysics of supermassive black hole binaries, and present 95% upper limits on the string tension, Gmu/c^2, characterising a background produced by a cosmic string network for a set of possible scenarios, and for a stochastic relic GWB. For a Nambu-Goto field theory cosmic string network, we set a limit Gmu/c^2<1.3times10^{-7}, identical to that set by the {\it Planck} Collaboration, when combining {\it Planck} and high-ell Cosmic Microwave Background data from other experiments. For a stochastic relic background we set a limit of Omega^relic_gw(f)h^2<1.2 times10^{-9}, a factor of 9 improvement over the most stringent limits previously set by a pulsar timing array.

  • 36 authors
·
Apr 14, 2015

The Two-Pass Softmax Algorithm

The softmax (also called softargmax) function is widely used in machine learning models to normalize real-valued scores into a probability distribution. To avoid floating-point overflow, the softmax function is conventionally implemented in three passes: the first pass to compute the normalization constant, and two other passes to compute outputs from normalized inputs. We analyze two variants of the Three-Pass algorithm and demonstrate that in a well-optimized implementation on HPC-class processors performance of all three passes is limited by memory bandwidth. We then present a novel algorithm for softmax computation in just two passes. The proposed Two-Pass algorithm avoids both numerical overflow and the extra normalization pass by employing an exotic representation for intermediate values, where each value is represented as a pair of floating-point numbers: one representing the "mantissa" and another representing the "exponent". Performance evaluation demonstrates that on out-of-cache inputs on an Intel Skylake-X processor the new Two-Pass algorithm outperforms the traditional Three-Pass algorithm by up to 28% in AVX512 implementation, and by up to 18% in AVX2 implementation. The proposed Two-Pass algorithm also outperforms the traditional Three-Pass algorithm on Intel Broadwell and AMD Zen 2 processors. To foster reproducibility, we released an open-source implementation of the new Two-Pass Softmax algorithm and other experiments in this paper as a part of XNNPACK library at GitHub.com/google/XNNPACK.

  • 2 authors
·
Jan 13, 2020

Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming

Lately, post-training quantization methods have gained considerable attention, as they are simple to use, and require only a small unlabeled calibration set. This small dataset cannot be used to fine-tune the model without significant over-fitting. Instead, these methods only use the calibration set to set the activations' dynamic ranges. However, such methods always resulted in significant accuracy degradation, when used below 8-bits (except on small datasets). Here we aim to break the 8-bit barrier. To this end, we minimize the quantization errors of each layer separately by optimizing its parameters over the calibration set. We empirically demonstrate that this approach is: (1) much less susceptible to over-fitting than the standard fine-tuning approaches, and can be used even on a very small calibration set; and (2) more powerful than previous methods, which only set the activations' dynamic ranges. Furthermore, we demonstrate how to optimally allocate the bit-widths for each layer, while constraining accuracy degradation or model compression by proposing a novel integer programming formulation. Finally, we suggest model global statistics tuning, to correct biases introduced during quantization. Together, these methods yield state-of-the-art results for both vision and text models. For instance, on ResNet50, we obtain less than 1\% accuracy degradation --- with 4-bit weights and activations in all layers, but the smallest two. We open-sourced our code.

  • 5 authors
·
Jun 14, 2020

Deep Probability Estimation

Reliable probability estimation is of crucial importance in many real-world applications where there is inherent (aleatoric) uncertainty. Probability-estimation models are trained on observed outcomes (e.g. whether it has rained or not, or whether a patient has died or not), because the ground-truth probabilities of the events of interest are typically unknown. The problem is therefore analogous to binary classification, with the difference that the objective is to estimate probabilities rather than predicting the specific outcome. This work investigates probability estimation from high-dimensional data using deep neural networks. There exist several methods to improve the probabilities generated by these models but they mostly focus on model (epistemic) uncertainty. For problems with inherent uncertainty, it is challenging to evaluate performance without access to ground-truth probabilities. To address this, we build a synthetic dataset to study and compare different computable metrics. We evaluate existing methods on the synthetic data as well as on three real-world probability estimation tasks, all of which involve inherent uncertainty: precipitation forecasting from radar images, predicting cancer patient survival from histopathology images, and predicting car crashes from dashcam videos. We also give a theoretical analysis of a model for high-dimensional probability estimation which reproduces several of the phenomena evinced in our experiments. Finally, we propose a new method for probability estimation using neural networks, which modifies the training process to promote output probabilities that are consistent with empirical probabilities computed from the data. The method outperforms existing approaches on most metrics on the simulated as well as real-world data.

  • 11 authors
·
Nov 20, 2021

MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards

The rapid scaling of large language models necessitates more lightweight finetuning methods to reduce the explosive GPU memory overhead when numerous customized models are served simultaneously. Targeting more parameter-efficient low-rank adaptation (LoRA), parameter sharing presents a promising solution. Empirically, our research into high-level sharing principles highlights the indispensable role of differentiation in reversing the detrimental effects of pure sharing. Guided by this finding, we propose Mixture of Shards (MoS), incorporating both inter-layer and intra-layer sharing schemes, and integrating four nearly cost-free differentiation strategies, namely subset selection, pair dissociation, vector sharding, and shard privatization. Briefly, it selects a designated number of shards from global pools with a Mixture-of-Experts (MoE)-like routing mechanism before sequentially concatenating them to low-rank matrices. Hence, it retains all the advantages of LoRA while offering enhanced parameter efficiency, and effectively circumvents the drawbacks of peer parameter-sharing methods. Our empirical experiments demonstrate approximately 8x parameter savings in a standard LoRA setting. The ablation study confirms the significance of each component. Our insights into parameter sharing and MoS method may illuminate future developments of more parameter-efficient finetuning methods.

  • 8 authors
·
Oct 1, 2024

Planck 2018 results. VI. Cosmological parameters

We present cosmological parameter results from the final full-mission Planck measurements of the CMB anisotropies. We find good consistency with the standard spatially-flat 6-parameter LambdaCDM cosmology having a power-law spectrum of adiabatic scalar perturbations (denoted "base LambdaCDM" in this paper), from polarization, temperature, and lensing, separately and in combination. A combined analysis gives dark matter density Omega_c h^2 = 0.120pm 0.001, baryon density Omega_b h^2 = 0.0224pm 0.0001, scalar spectral index n_s = 0.965pm 0.004, and optical depth tau = 0.054pm 0.007 (in this abstract we quote 68,% confidence regions on measured parameters and 95,% on upper limits). The angular acoustic scale is measured to 0.03,% precision, with 100theta_*=1.0411pm 0.0003. These results are only weakly dependent on the cosmological model and remain stable, with somewhat increased errors, in many commonly considered extensions. Assuming the base-LambdaCDM cosmology, the inferred late-Universe parameters are: Hubble constant H_0 = (67.4pm 0.5)km/s/Mpc; matter density parameter Omega_m = 0.315pm 0.007; and matter fluctuation amplitude sigma_8 = 0.811pm 0.006. We find no compelling evidence for extensions to the base-LambdaCDM model. Combining with BAO we constrain the effective extra relativistic degrees of freedom to be N_{rm eff} = 2.99pm 0.17, and the neutrino mass is tightly constrained to sum m_nu< 0.12eV. The CMB spectra continue to prefer higher lensing amplitudes than predicted in base -LambdaCDM at over 2,sigma, which pulls some parameters that affect the lensing amplitude away from the base-LambdaCDM model; however, this is not supported by the lensing reconstruction or (in models that also change the background geometry) BAO data. (Abridged)

  • 182 authors
·
Jul 17, 2018

Sample complexity of data-driven tuning of model hyperparameters in neural networks with structured parameter-dependent dual function

Modern machine learning algorithms, especially deep learning based techniques, typically involve careful hyperparameter tuning to achieve the best performance. Despite the surge of intense interest in practical techniques like Bayesian optimization and random search based approaches to automating this laborious and compute intensive task, the fundamental learning theoretic complexity of tuning hyperparameters for deep neural networks is poorly understood. Inspired by this glaring gap, we initiate the formal study of hyperparameter tuning complexity in deep learning through a recently introduced data driven setting. We assume that we have a series of deep learning tasks, and we have to tune hyperparameters to do well on average over the distribution of tasks. A major difficulty is that the utility function as a function of the hyperparameter is very volatile and furthermore, it is given implicitly by an optimization problem over the model parameters. To tackle this challenge, we introduce a new technique to characterize the discontinuities and oscillations of the utility function on any fixed problem instance as we vary the hyperparameter; our analysis relies on subtle concepts including tools from differential/algebraic geometry and constrained optimization. This can be used to show that the learning theoretic complexity of the corresponding family of utility functions is bounded. We instantiate our results and provide sample complexity bounds for concrete applications tuning a hyperparameter that interpolates neural activation functions and setting the kernel parameter in graph neural networks.

  • 3 authors
·
Jan 23