Artificial Intelligence (AI) plays a huge role in accelerating necessary transformations of reaching the goal of net-zero emissions globally in less than 30 years, such as integrating renewable energy and reducing the cost of carbon capture. However, the use of AI must be environmentally sustainable. Computing resources used in AI generate two forms of CO2 emissions: operational emissions from the source of electricity, and embodied emissions from manufacturing processes. To reduce both forms of emissions, research is focused on improving the efficiency of AI models on computing hardware while maintaining accuracy. Major cloud providers are transitioning to 100% carbon-free energy by 2030. Tech giants are also developing tools to help AI developers make their models more sustainable and extending AI efficiency gains beyond the cloud to constrained edge hardware.
Developing machine learning models is often associated with several key steps, such as selection of model architectures and algorithms, hyperparameter tuning, training on existing datasets, and making predictions on new data. Here we deep dive into how the energy-efficiency of AI can be improved in each key step.
Efficient Model Architectures and Hyperparameter Tuning
The AI research industry is using the automated optimization techniques, Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO), to identify resource-efficient machine learning models. These techniques search for promising candidates within combinatorial search spaces that are too large to search exhaustively. While traditionally used to optimize for prediction accuracy, they can also optimize for computational efficiency or cost. It is important that these techniques themselves operate efficiently. Research-based meta-learning techniques, such as Probabilistic Neural Architecture Search (PARSEC), have been incorporated into various machine learning platforms for efficient model selection and hyperparameter optimization. PARSEC uses a memory-efficient sampling procedure that greatly reduces memory requirements compared to previous methods, while Weightless PARSEC achieves comparable results with 100x less computational cost.
The growth of very large-scale deep learning models motivated the development of Tensor Programs, enabling µTransfer, a procedure that can transfer training hyperparameters across model sizes. This allows the optimal hyperparameters discovered for a small model to be applied directly to a target scaled-up version. Compared to directly tuning the hyperparameters of large models, µTransfer enables equivalent accuracy levels while using at least an order of magnitude less compute, with no limit to the efficiency gain as the target model size grows. µTransfer makes hyperparameter tuning possible for very large-scale models for which it was previously prohibitively costly. It is available as an open-source package for models of any size. These advances in optimization techniques and procedures enable more efficient and sustainable machine learning models.
Efficient Model Training
The importance of improving the efficiency of both pre-training and fine-tuning stages of training large language models is also vital to reduce AI’s carbon footprint. Pre-training draws from large datasets to produce a general model, while fine-tuning tailors the model for specific tasks. LoRA (Low-Rank Adaptation of Large Language Models) uses trainable rank decomposition matrices to greatly reduce the number of trainable parameters for downstream tasks during fine-tuning, reducing the GPU memory requirement by a factor of 3 compared to GPT-3 175B fine-tuned with Adam. EarlyBERT provides a general computationally efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models by pruning the model into a sparser version of the original, resulting in comparable performance to standard BERT with 35-45% less compute time needed for training.
LoRA has been successfully applied in practice to make real-world model development more resource-efficient, such as reducing the storage and hardware requirements for Power Apps Ideas which uses AI to help people create apps using plain language. A package for integrating LoRA with PyTorch models is available on GitHub.
There are various techniques being pursued to optimize the efficiency of machine learning models during inference, which is the process of making predictions on new data. The goal is to maximize predictive accuracy while minimizing computational cost, response times, and embodied carbon. These techniques include algorithmic improvements, model compression, knowledge distillation, quantization, and factorization.
Factorizable neural operators (FNO) use low-rank matrix decompositions to achieve hardware-friendly compression of the costliest neural network layers, reducing memory usage by 80% and prediction time by 50% while maintaining accuracy. Knowledge distillation is another method that compresses large-scale, pre-trained models commonly used for language tasks by distilling knowledge from a larger teacher model into a smaller student model. Several developments based on this approach have achieved significant reductions in parameter counts, memory requirements, and inference latency.
By combining FNO with other model compression methods, such as distillation and quantization, may yield further improvements, here are the few examples that may intrigue you, like XtremeDistil, MiniLM & AutoDistil. These optimization techniques are critical for reducing the environmental impact of AI at scale, as large models may handle trillions of inferences per day.
This article is drafted with the assistance of A.I. and referencing from the sources below:
The work described in this article was supported by InnoHK initiative, The Government of the HKSAR, and Laboratory for AI-Powered Financial Technologies.
AIFT strives but cannot guarantee the accuracy and reliability of the content, and will not be responsible for any loss or damage caused by any inaccuracy or omission.