Running Large Transformer Models on Mobile and Edge Devices
Introduction
Running artificial intelligence models on mobile devices and in edge environments has become an increasingly important topic in recent years. In particular, running large Transformer-based language models directly on the device instead of in the cloud offers significant advantages in terms of privacy and low latency. User privacy is protected as data never leaves the device, and the possibility of real-time responses emerges as network latency is eliminated. However, the millions of parameters and high computational requirements of these large models make it challenging to run them efficiently under the processor and memory constraints of mobile devices.
In this article, we will cover the model optimization techniques and tools necessary to run large Transformer models on mobile and edge devices. Within the Hugging Face ecosystem, we will examine ways to make models suitable for mobile (such as quantization, knowledge distillation, and model pruning) and detail deployment methods using technologies and tools like ONNX, Core ML, and Hugging Face Optimum. We will include technical explanations and code snippets to guide mobile developers and AI engineers, along with real-world examples.
Why Run Models On-Device?
Running models directly on the device instead of on cloud infrastructure has several advantages:
- Privacy: User data does not leave the device, allowing sensitive information to be processed without being sent to the cloud. For example, Apple has emphasized that running large language models on-device protects user privacy. This approach is critical for ensuring data privacy in fields like health or finance.
- Low Latency: Model outputs can be received instantly without needing an internet connection. Applications can respond in real-time as there is no network traffic or server latency. This is especially important for time-critical applications such as augmented reality, voice assistants, or autonomous devices.
- Offline Usage: Mobile applications can continue to work even without an internet connection. For example, a translation app can perform offline translations by storing a pre-trained Transformer model on the device.
- Cost and Scalability: Continuously running models on cloud servers can be costly. For applications with millions of users, running the model on the user's device instead of making separate cloud requests for each client reduces server costs. Furthermore, powerful processors on devices (like the Apple Neural Engine, Qualcomm Hexagon DSP) often have idle capacity—on-device AI efficiently uses this existing hardware.
Of course, running large models on-device involves some challenges. Obstacles include high memory requirements, limited battery life, the limited computational power of mobile processors, and thermal constraints. Below, we will discuss the model compression and acceleration techniques commonly used to overcome these challenges.
Methods for Shrinking and Accelerating Large Models
To make a large machine learning model efficient in a mobile environment, it is necessary to reduce its size and computational load. Among the most popular methods are quantization, knowledge distillation, and model pruning. These techniques can often be used together to achieve significant model reduction. Below, we detail each method under separate headings.
Quantization (Reducing Arithmetic Precision)
Quantization is the process of reducing the numerical representation of model weights and activations. For example, common approaches include using 16-bit floating-point (FP16) or 8-bit integer (INT8) instead of 32-bit floating-point (FP32). This dramatically reduces model size and memory usage (switching from FP32 to INT8 provides about a 4x size reduction). At the same time, low-bit computations can run faster with appropriate hardware support (e.g., vector operations, GPU or NPU accelerators).
Quantization can be static or dynamic. Dynamic quantization converts weights to an integer format at runtime and typically only quantizes the weights (activations may still be FP32). This approach does not require additional training and is used to accelerate specific layers in Transformer models (e.g., matrix multiplications). In Static quantization (with quantization-aware training or post-training calibration), the model's activation ranges are measured with a calibration dataset, and both weights and activations are converted to a low-bit representation in advance. Static quantization generally gives better results in terms of accuracy preservation but is slightly more involved (it requires calibrating the model).
In the Hugging Face environment, the Optimum library provides easy-to-use quantization tools with ONNX Runtime support. For example, exporting a Transformers model to ONNX format and applying 8-bit dynamic quantization can be done with just a few lines of code. The following code demonstrates using Hugging Face Optimum to convert a DistilBERT model to ONNX format and dynamically quantize it for the AVX512-VNNI instruction set:
from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig
# Load a PyTorch Transformers model from HF Hub in ONNX format
onnx_model = ORTModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english", export=True
)
# Create the quantizer object
quantizer = ORTQuantizer.from_pretrained(onnx_model)
# Define the quantization configuration (e.g., for AVX512_VNNI, dynamic and not per-channel)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
# Quantize the model and save the output
quantizer.quantize(save_dir="quantized-model/", quantization_config=qconfig)
As a result of the above process, all supported layers of the model are represented as 8-bit integers in ONNX. The Optimum library abstracts ONNX Runtime’s quantization tools, simplifying this process. It's also possible to do the same with a one-line command:
optimum-cli onnxruntime quantize --onnx_model onnx_model_folder/ --arm64 -o onnx_quantized/
The --arm64 flag in the command selects a quantization mode suitable for ARM-based mobile processors (e.g., 8-bit based on SIMD). The Optimum CLI can automatically determine the best quantization settings for various hardware targets like ARM64, AVX2, and AVX512. The quantized model will be much smaller than the original and provides acceleration on supported hardware. For example, 8-bit quantization of a BERT model reduces the model file to one-fourth its size, bringing both storage and memory advantages.
Performance: The performance gain from quantization depends on the hardware used. Many mobile CPUs have vector units optimized for 8-bit integer matrix multiplications. Apple Neural Engine or Android NNAPI-supported DSPs/NPUs also offer dramatic speed-ups for 8-bit operations. ONNX Runtime can integrate with NNAPI or XNNPACK on Android and Core ML on iOS, allowing quantized models to benefit from this hardware. If the hardware does not directly support 8-bit operations, the quantized model will still run, but the gain may be limited. Additionally, low bit depth can lead to some accuracy loss, especially in very deep models. Therefore, for critical applications, mixed-precision solutions (e.g., some layers 8-bit, others 16-bit) or accuracy improvements via post-quantization fine-tuning are used.
Knowledge Distillation
Knowledge distillation is a technique that aims to transfer the knowledge from a large, high-capacity teacher model to a smaller, faster student model. In this method, the student model is trained to mimic the outputs or intermediate representations of the teacher. As a result, the student model, despite having fewer parameters and less depth, can achieve performance close to the teacher.
Distillation has been very successfully applied, especially to Transformer models. For example, DistilBERT is a model created by the Hugging Face team that is 40% smaller and faster than the BERT base model, achieved through distillation. DistilBERT is a ~66 million parameter version of the BERT-Base model and offers accuracy close to the original BERT. Trained with a triple loss (language modeling, distillation loss, and cosine similarity loss), DistilBERT contains 6 Transformer layers instead of BERT's 12, thus halving both memory and processing load.
Similarly, models like TinyBERT and MobileBERT were also distilled from a BERT teacher. MobileBERT is a model optimized for mobile devices, created by Google by modifying the structure of the large BERT model (adding bottleneck layers and linear transformations). MobileBERT has about 25 million parameters and has been shown to run many times faster than BERT-Base on Pixel phones. Another example, SqueezeBERT, reduces the BERT size by combining the attention mechanism with a different structure and drawing inspiration from convolutional layers; it was reported to run 4.3 times faster than BERT-base on a Pixel 3 phone.
The basic idea in distillation is that the teacher model's soft logits or intermediate layer outputs guide the student. The student model tries to learn the probability distributions predicted by the teacher and, if available, its attention maps. This allows it to learn how the teacher spreads its knowledge, rather than just learning the correct answer. Ultimately, the student model achieves higher accuracy than an equivalent model trained from scratch on the same data. For example, if a BERT-Base model gets 90% accuracy, a small model trained on the same data might only achieve 85%; but with distillation, this small model can approach 88-89% accuracy.
Advantages: Models to which knowledge distillation has been applied are generally both smaller in size and faster. This is critically important for real-time applications on mobile devices. Another benefit of distillation is that if the teacher model has been pre-trained on a large amount of data, the student model can achieve reasonable generalization with less data—it's as if it benefits from the teacher's experience. Indeed, a teacher can be transferred to multiple students, thus creating model families of different sizes. NVIDIA researchers reported that when they reduced the Llama 3.1 (8B parameters) model to 4B and 2.7B parameter Minitron models using distillation and pruning, these smaller models performed close to the original but required much less computation. They also stated that distillation reduced the amount of data needed to train the 8B model by 40-fold.
What to watch out for: When applying distillation, the student model's architecture is usually predefined (e.g., half the number of layers, narrower hidden dimensions, etc.). If the student is chosen to be too small, there is a risk of it being unable to fully absorb the teacher's knowledge—this is called a capacity gap in the literature (if the student's capacity is insufficient to represent the teacher, performance drops). Therefore, in practice, the student's size must be balanced with the target task's requirements. Also, the distillation training process can be more complex than standard training, as it involves optimizing multiple loss terms simultaneously. Nevertheless, modern frameworks (e.g., distillation support in the Hugging Face Trainer API or the training scripts provided by Tencent for TinyBERT) are making this process easier. In conclusion, distilled models offer solutions for mobile applications that are almost as good as the teacher, but much more efficient.
Model Pruning
Model pruning is the process of making a model smaller and faster by removing (zeroing out) certain parts of it. The pruning process can be unstructured or structured.
Unstructured pruning: This means setting the weights with the smallest importance in a weight matrix to zero. For example, setting 20% of the weights in a layer to zero makes the model parameters sparse. This approach doesn't directly reduce the model's file size (zeros still take up space), but it can bring speed-ups if special runtimes can skip the computation for these zeros. Unfortunately, many standard compilers are not efficient at skipping zeros, so the speed gain from pure unstructured pruning on mobile can be limited. Still, unstructured pruning can reduce the memory bandwidth load and be used as a pre-optimization step.
Structured pruning: This involves the complete removal of certain elements from the model's structure. For example, completely pruning some attention heads in a Transformer model or removing some layers is structured pruning. This way, the pruned parts are not computed at all, and the model's size effectively shrinks. For example, it is possible to get a model with fewer heads by pruning the least significant attention heads from each layer in a BERT model. Structured pruning is a more advanced optimization and usually requires the model to be retrained (fine-tuned) after pruning, as suddenly deleting some neurons can cause a significant performance drop. Retraining allows the remaining parts of the model to recover and performance to increase.
Current research has shown that combining pruning with distillation is very effective. In NVIDIA's aforementioned Minitron study, they first pruned the Llama 8B model based on layers and width down to 4B parameters, then regained its performance with distillation. In this way, the pruned model achieved a 16% better overall test score (MMLU) compared to the original training. Hugging Face has similarly achieved acceleration by using block pruning on the BLOOM model, for example (GLU-structured pruning studies exist). Another use of pruning is to provide flexibility for limited resource consumption in a model: For instance, you can derive different-sized versions of a model (e.g., 100 million, 50 million, 10 million parameter versions) from the same base model through pruning and choose the appropriate one based on the device's capacity.
The Hugging Face Optimum-Intel integration offers integrated pruning and quantization with tools like Intel Neural Compressor (INC) and Neural Network Compression Framework (NNCF). For example, the INCTrainer structure within Optimum can automatically prune model weights at certain steps during training. An example configuration is shown below:
from optimum.intel.neural_compressor import INCTrainer
from neural_compressor import WeightPruningConfig
pruning_config = WeightPruningConfig(
pruning_type="magnitude",
target_sparsity=0.2,
pruning_scope="local",
# ...start/end steps, etc.
)
trainer = INCTrainer(
model=model,
args=TrainingArguments(...),
train_dataset=...,
pruning_config=pruning_config,
# ... other Trainer parameters
)
trainer.train()
trainer.save_model("pruned_model")
A configuration like the one above will zero out the smallest weights at a certain rate during each step of training. However, this magnitude-based pruning, as seen in the configuration, zeros out weights in a "local" and proportional way. This does not reduce the model size; it just means 20% of the weights become 0. For the model size to truly shrink, these weights need to be removed (row/column deletion) after a sufficient number have been zeroed. It is possible to perform structured pruning using Optimum-Intel with NNCF, and this method can genuinely produce smaller model files.
Key Points: When applying pruning, one must be careful that overly aggressive pruning can reduce model capacity and lead to a drop in accuracy. Generally, a pruning rate of 10-30% (target sparsity) provides speed gains with a small performance penalty. High rates of pruning (e.g., 80-90%) are only feasible if compensated for with distillation or special retraining techniques. Furthermore, real gains are achieved if the infrastructure used to run the pruned models can accelerate sparse matrix operations (e.g., NVIDIA cuSparse or sparse libraries on CPU). Common inference libraries on mobile devices are not yet very efficient at sparse matrix multiplications, so in practical mobile scenarios, structured pruning should be preferred. For example, pruning an entire attention head or removing an entire layer both reduces the model size and directly means fewer operations on the mobile device.
In summary, by using quantization, distillation, and pruning techniques together, the size and runtime of an original Transformer model can be reduced very significantly. For example, take the BERT-base model: 110 million parameters, using ~440 MB of memory with FP32 weights, and maybe it can process a few sentences per second on a mobile CPU. Let's say we reduced the model to 66M parameters with distillation (DistilBERT), switched to 8-bit weights with quantization (110 MB), and pruned some heads, leaving 60M effective parameters. The resulting model might be under ~100 MB in size and, with an optimized mobile library, could comfortably provide near-real-time responses on the device. Indeed, models like TinyBERT and MiniLM in the Hugging Face Transformers library were trained with such techniques and have become popular for mobile use.
Model Conversion and Runtime Methods for Mobile Devices
After shrinking our model with optimization techniques or choosing a small model from the start, the issue of running (inference) it on a mobile platform arises. There are different model formats and runtime options for different mobile platforms (Android, iOS, IoT devices, etc.). In this section, we will describe common approaches like ONNX, Core ML, and TensorFlow Lite, and ways to convert to these formats using the Hugging Face toolchain. We will also discuss the hardware compatibility and performance characteristics of these methods.
Platform-Independent Deployment with ONNX and ONNX Runtime
ONNX (Open Neural Network Exchange) is a widely supported intermediate format that enables model exchange between different machine learning frameworks. Even if Transformer models are trained in PyTorch or TensorFlow, they can be exported to the ONNX format and run in various environments, including mobile devices. The advantage of ONNX is that it has standardized definitions for each operation. This means when you convert your model to ONNX, any infrastructure that supports it can interpret that model. For example, it's possible to convert a model trained with PyTorch to ONNX and use it with TensorFlow.
The Hugging Face Transformers library has a built-in tool for converting models to ONNX. You can export a popular model with a simple command. For example:
pip install transformers[onnx] # Required dependencies for the ONNX export module
python -m transformers.onnx --model=distilbert-base-uncased --feature=sequence-classification onnx/
The command above will save the DistilBERT text classification model in ONNX format to the onnx/model.onnx file. During the export, the Transformers library uses the config classes supported by the respective model for ONNX and also verifies that the outputs are produced correctly. Many architectures are supported (BERT, RoBERTa, GPT-2, T5, MobileBERT, SqueezeBERT, etc., have ready-made ONNX configs).
The most common option for running a model converted to ONNX format is the ONNX Runtime (ORT) library. Developed by Microsoft, ORT is used to efficiently run ONNX models in both server and mobile environments. ORT offers a modular structure called Execution Providers to use different accelerators on different devices. The important ones for mobile are:
- CPU (Default): A mode that runs on almost any device, using optimized libraries on the pure CPU (e.g., XNNPACK). It's a reliable choice for small models and in cases of thermal issues.
- NNAPI (Android Neural Networks API): An interface provided by Android to use DSP, GPU, or NPU hardware on Android devices. ONNX Runtime can enable NNAPI mode to run supported operations on the phone's special accelerators. This can provide huge speed gains, especially with quantized models (e.g., Qualcomm Hexagon DSP performs int8 matrix operations much faster than the CPU).
- Core ML (iOS): For iPhone/iPad devices, ONNX Runtime can directly use the Core ML infrastructure. This allows units like the Apple Neural Engine to be engaged.
- XNNPACK: This is an optimized CPU computation library available on both Android and iOS. It is particularly fast for float32 and float16 operations on mobile CPUs. ORT can automatically use XNNPACK for non-quantized models.
Integrating an ONNX model into a mobile application requires adding the respective platform's ORT package to the project and making a few API calls. ORT provides ready-made libraries for Java, C++, and Swift/Obj-C. For example, on Android, after adding the .aar package, you can create an OrtEnvironment with the Java interface, load the model via an InferenceSession, and run it. The relevant code is similar to using PyTorch or TensorFlow, just the framework is different. It's also possible to pull the ONNX model provided by Hugging Face directly from the HF Hub into the application; however, to control the application size, it's usually preferred to distribute the model file with the application.
Performance and Size: ONNX Runtime's mobile distribution is quite comprehensive by default and takes up ~10-15 MB of space (as it includes various operators). However, this size can be reduced by removing unnecessary parts (minimal build). For example, it's possible to reduce the ORT library to 4-5 MB by compiling only the operations used by a specific model. The size of the ONNX model file is similar to the original PyTorch/TensorFlow weight file size; it will be smaller if you have quantized it. Combining the optimization techniques we mentioned above with ONNX is very effective: you can get maximum efficiency for both the model and the hardware by running a distilled and quantized model with ONNX Runtime + NNAPI.
Let's give an example scenario: Suppose a user will classify a short text in the application. You've converted the DistilBERT model to ONNX and quantized it to 8-bit. Your Android application loads this ONNX model using ORT. When the application receives a text input, it first tokenizes the sentence using the Hugging Face Tokenizer (e.g., AutoTokenizer) and converts it to numpy arrays, then feeds it to ORT. ORT, in the background, uses NNAPI if available to leverage accelerators other than the CPU. As a result, it gets the classification result in milliseconds and displays it to the user. This pipeline is much faster and more private than sending it to the cloud. There's no big difference from using PyTorch in terms of code, other than using ORT. Indeed, the Hugging Face documentation states: "Once the model is exported to the ONNX format, it is easy to run it on many accelerators – for example, we can load and run it with ONNX Runtime". In other words, switching to the ONNX format is a key to flexibility and speed on mobile.
Transformer Models on iOS Devices with Core ML
Core ML is the framework used to run machine learning models in the Apple ecosystem (iOS, macOS, watchOS, tvOS). It offers the ability to run models on Apple silicon (A-series chips, M-series Mac chips) with extremely high performance and low power consumption. Like ONNX, Core ML uses different model formats by converting them into its own format called .mlmodel. Apple states that models run with Core ML are automatically and most efficiently balanced between the Apple Neural Engine, GPU, and CPU—so the developer doesn't need to worry about low-level hardware details.
To convert a PyTorch or TensorFlow model to Core ML format, Apple's Core ML Tools (coremltools) Python package is used. The conversion is generally a two-step process:
- Converting the model to TorchScript format (for PyTorch): PyTorch models are first converted to TorchScript (
.ptfile). TorchScript is a serializable representation of the model, independent of Python code. It can be done in two ways: Tracing (deriving the computation graph from a specific input example) or Scripting (compiling the model code at the Python AST level). Tracing produces fast and optimized code, but scripting may be necessary for models with dynamic control flow. Models like Transformers, which have a specific set of layers, can usually be successfully converted with tracing. - Converting the TorchScript model to Core ML format: The
coremltoolspackage converts a TorchScript (or SavedModel, ONNX, Keras model) to a.mlmodelfile using thect.convert()function. In this step, the model's input/output formats and, preferably, the target iOS version are specified. For example:The code above traces the PyTorch model and saves the Core ML model. Here, we are fixing the input tensor's dimension withimport coremltools as ct traced_model = torch.jit.trace(my_model, example_input) mlmodel = ct.convert( traced_model, inputs=[ct.TensorType(shape=input_shape)] ) mlmodel.save("model.mlmodel")input_shape; Core ML generally works with fixed-size tensors. Flexible dimensions can be specified if necessary, but fixing them is recommended for performance.
There are a few points to be aware of when converting large Transformer-based models to Core ML format:
- Memory size: Core ML Tools saves model weights as FP16 (Float16) by default. This halves the model size (16-bit instead of 32-bit). For example, in Apple's own demo, they state that when they converted the Llama2 8B model to Core ML, it took up ~16 GB (8B parameters * 2 bytes). This may seem quite large; so in practice, more aggressive quantizations like 8-bit or 4-bit (e.g., quantizing to 4-bit with the
GPTQalgorithm and then converting to Core ML) come into play. Apple states that Core ML also supports quantization-aware conversion, but for large models, the best result is obtained with GPU usage and 16-bit. - Key-value cache (KV cache): In language generation models (like GPT-2, Llama), per-head caches (past key values) are used to improve performance in subsequent token generation. It can be difficult to support this mechanism in Core ML because it requires embedding a recursive structure within the model. In Apple's Llama example, it's seen that they exported a simpler model by disabling the KV cache in the first stage (
use_cache=False). They later discuss ways to manage this in an optimized version. That is, if you are converting a language model, converting a model that generates a full sentence at once (batched) is easier than converting a structure that generates single tokens step-by-step. - Core ML Tools version compatibility: It's important to use compatible versions of PyTorch and coremltools. For example, as of 2023, coremltools 6+ versions are better integrated with PyTorch 2 and have come to support most Transformer ops. At WWDC 2023, Apple even announced Swift packages for running Transformers models directly in Core ML (like Swift Core ML Stable Diffusion, Swift Transformers). This shows how much the Core ML side has developed.
Performance: On Apple devices, models running with Core ML get hardware acceleration from units like the Neural Engine and GPU. According to figures published by Apple, the 8-billion-parameter Llama 3.1 model reached a generation speed of ~33 tokens/sec on an M1 Max chip. This is a very impressive value and can offer a fluent experience for an on-device LLM in daily use. Of course, this performance was achieved in a scenario where the model was running at 16-bit and using the GPU. It is possible to run a smaller model (e.g., 1-2B parameters, 4-bit) at reasonable speeds on an iPhone as well—for example, it has been reported by the community that a 70M parameter chat model gives near-real-time responses on a 2024 model iPhone. These results show the power of specialized hardware in mobile devices. Core ML offers an ideal way to use this power. Hugging Face is collaborating with Apple to publish various Core ML-compatible models (for example, models like Stable Diffusion Core ML, MobileNet Core ML are listed under the apple organization on the Hugging Face Hub).
Real-World Example: It's known that Apple uses a tiny Transformer model in its own Mail application for spam detection. This model is embedded in the application in Core ML format and determines on-device whether incoming emails are spam. Similarly, mobile keyboard applications like SwiftKey and Gboard have started running small versions of Transformer-based language models (like DistilGPT-2 style or custom small LMs) on-device for text prediction. All these applications compile and use the model with Core ML or similar frameworks. From a developer's perspective, when you add a .mlmodel file to your Xcode project, a Swift/Obj-C class is automatically generated, and you can easily provide input and get output from the model through this class. This has made machine learning in the Apple ecosystem very accessible.
TensorFlow Lite and Other Approaches
Another popular method in the Android ecosystem is using TensorFlow Lite (TFLite). TFLite is a solution optimized for running TensorFlow models on mobile and embedded devices, converting models into a special binary format with the .tflite extension. The advantage of TFLite is its broad hardware support developed by Google and its small runtime size (around ~1-2MB). On many Android devices, TFLite can perform acceleration with NNAPI or directly with its own GPU Delegate mechanism. TFLite is frequently used, especially for image processing and basic NLP tasks.
The Hugging Face Optimum library also supports converting Transformers models to TFLite format. You can first convert the PyTorch model to TF and then create a .tflite file with the optimum-cli export tflite command. For example, to convert the BERT-base model:
pip install optimum[exporters-tf] # Required plugin for TFLite export
optimum-cli export tflite --model google/bert-base-uncased --sequence_length 128 bert_tflite/
This command takes the official BERT base model from the Hugging Face Hub, fixes the input size to 128, and produces a TFLite model. Logs are seen in the output indicating it was saved successfully. During the conversion, TFLite’s post-training quantization features can also be used if desired. For example, it's possible to reduce the weights to FP16 with a parameter like --quantize float16 (Optimum also supports TFLite configurations).
Using TFLite, similar to ONNX, relies on adding a library to the application and feeding input to the model. On Android, the TensorFlow Lite Interpreter class is generally used; you load the .tflite file, feed the data as a ByteBuffer, and get the output. Hugging Face has also started hosting models in TFLite format on the Hub. For example, a .tflite quantized version of a small sentence transformer model based on MiniLM is available. This allows developers to directly download the TFLite model and integrate it into their applications.
Apart from this, for those who prefer to use PyTorch, the PyTorch Mobile (TorchScript) approach can be mentioned. PyTorch allows you to convert your model to TorchScript and run it on Android/iOS with a C++ API. However, since the PyTorch Mobile runtime is relatively large (~5MB+) and its GPU support is limited (no GPU support on Android, limited with Metal on iOS), ONNX/TFLite is generally preferred for performance-critical tasks. Still, those who already use the PyTorch ecosystem and need to port their model quickly can choose the TorchScript path. Hugging Face Transformers models can be loaded directly in a TorchScript-compatible way with the .from_pretrained(..., torchscript=True) parameter. Then you can get the .pt file to be used on mobile by calling traced_model.save("model.pt").
In summary, there are multiple options for running models on mobile devices, and the choice should be made based on the project's needs:
- ONNX Runtime: Framework-independent, flexible, and high-performance. A slightly larger binary size, but covers both iOS and Android with a single solution.
- Core ML: Gives the best performance for iOS/macOS, specific to the Apple ecosystem.
- TensorFlow Lite: Popular for Android (and iOS), a very small and fast runtime, but generally easier to integrate with TF-based models.
- PyTorch Mobile: Offers the ability to easily port an existing PyTorch model, but its maturity on mobile is not as high as the others.
As can be seen, Hugging Face Optimum can offer conversion to formats like ONNX and TFLite and apply quantization in these formats under a single umbrella. This way, a developer wanting to run a Transformer model on mobile can achieve their goal by converting it to the appropriate format with Optimum and then using the relevant mobile library.
Optimization and Integration with Hugging Face Optimum
Hugging Face Optimum, which we've mentioned in various contexts above, is a toolkit that simplifies model optimization. To summarize, Optimum provides:
- Hardware Accelerator Integrations: Optimum integrates libraries specific to different hardware and software ecosystems with Transformers. For example, the
optimum.onnxruntimemodule is for ONNX Runtime,optimum.intelis for Intel's open-source tools (OpenVINO, Intel Neural Compressor, etc.),optimum.habanais for INTEL Gaudi :) , andoptimum.graphcoreis for IPUs. Within the scope of our topic, theoptimum.onnxruntimeandoptimum.exportersmodules are most prominent. - Easy Quantization Interface: Optimum presents ONNX Runtime's quantization tool as a simple Python interface or CLI. It is very practical to take a Transformers model (e.g., an .onnx model from a folder or directly from the HF Hub) and reduce it to 8-bit using the
ORTQuantizerclass. We showed this usage in the code example above. Supported quantization modes include dynamic, static (with calibration), and modes optimized for specific processor instruction sets. For example, you can select the best settings for ARM-based mobile CPUs by callingAutoQuantizationConfig.arm64(). Optimum has also begun to integrate newer quantization techniques likeGPTQ(4-bit LLM quantization). - Compilation and Conversion Tools: Optimum handles the conversion in a single step with the
optimum-cli export onnxorexport tflitecommands. Underneath these commands, Transformers' or ONNX's own converters are actually running, but Optimum makes it easy for you to set the parameters correctly and prevents mismatches. For example, when converting to TFLite, if the model's TensorFlow version doesn't exist in the background, it first automatically creates a TF model and then calls the TFLite Converter. - Intel Neural Compressor and NNCF Integration: If you want to do more advanced optimization (like combined quantization + pruning + distillation during training), the Optimum-Intel module gives you the ability to do this directly with the Trainer API. For example, applying quantization-aware training (QAT) during training or performing structured pruning with a specific sparsity target is possible with Optimum. These features are more for model developers, tools integrated into the training phase; but the results are super-efficient models that run on mobile. For example, using Optimum-Intel, 90% structured pruning + distillation has been applied to a BERT model, and the resulting model has been made to run 10 times faster while retaining 99% of the original's accuracy (such studies exist in the literature).
In short, Hugging Face Optimum is an important tool both for optimizing and deploying existing pre-trained models and for simplifying model compression research. Mobile developers enjoy the comfort of reducing complex conversion processes to simple commands using Optimum. For example, even a mobile developer with limited Python knowledge can quantize and convert a model with the Optimum CLI and then integrate it into their mobile application. The Hugging Face documentation encourages this, saying, "if you want to run your models with maximum efficiency, check out the Optimum library."
Conclusion and Example Scenarios
Running large Transformer models on mobile and edge devices has become possible when the right techniques are applied. With methods like quantization, distillation, and pruning, which we've discussed in this article, fitting and accelerating massive models for portable environments is now common practice. Especially the innovations in the Hugging Face ecosystem have made this process more accessible than ever. For example, a mobile application developer can easily add a fast and offline text classification feature by downloading a DistilBERT model from Hugging Face, converting it to ONNX + INT8 format using Optimum, and running it with ORT in their application.
Real-world examples show this trend is gaining momentum: Apple is promoting on-device ML in its own applications and developer tools thanks to Core ML; Google is working on language models integrated into Android phones (on-device NLP on Pixel devices); and companies like Meta and NVIDIA are encouraging their use on the Edge by releasing small versions (Alpaca, Minitron, etc.) of large models. The Hugging Face Hub now has tags and collections for mobile-compatible models, and quantized versions of many models are being shared.
In conclusion, the message for mobile developers and artificial intelligence engineers is this: By using the right tools, you can offer the benefits of large Transformer models to your users in the mobile environment as well. With the techniques mentioned in this article, your application can provide smart answers without sending user data to the cloud and can process text, images, or audio in real-time. Moreover, you can do this while providing a battery-friendly and fast experience by using the device's hardware accelerators to the fullest. Thanks to developing open-source tools (Optimum, ONNX Runtime, Core ML Tools, etc.) and continuously shrinking "large" models, fitting AI into a pocket is no longer a dream, but a reality of today.