Facing the recent emergence of Large Language Models (LLM), many companies have successfully harnessed the power of these models by combining them with their enterprise data. This has led to remarkable achievements in various fields. For instance, Google and Bing have utilised large models to enhance their search engine recommendations and improve user search experiences. Meanwhile, Bloomberg trains BloombergGPT 1 using financial data to provide generative AI assistance in the financial sector. While large language models present new opportunities for enterprises, the high deployment costs 2 and the potential risk 3 of cross-platform data leakage have made companies more cautious. Consequently, many companies actively seek ways to employ large language models locally, ensuring privacy protection and cost-efficiency. This raises an important question: Can a corresponding technology meet this demand? The answer lies in Knowledge Distillation, a model compression method that aims to reduce computational and storage costs while safeguarding data privacy. This method offers a straightforward solution for enterprises looking to deploy LLM locally while maintaining privacy and minimizing costs. In this article, we will provide a detailed introduction to Knowledge Distillation in conjunction with large language models.
Knowledge Distillation: How to Distil? What to Distil?
The academic community in computer science has long been aware of the challenges in computation and storage that arise from the ever-increasing size of models, both in terms of the number of models and the number of parameters. In response to this challenge, Geoffrey Hinton, a pioneer in artificial intelligence, introduced the concept of knowledge distillation in 2015 . Distillation, as seen in Figure 1, is a technique used in chemistry to separate components of a mixture based on their boiling points. The total volume of the liquid before distillation is either greater than or equal to the total volume after distillation. By drawing on this chemical concept, we can metaphorically understand knowledge distillation. In knowledge distillation, the “container” is a model. The heated container is commonly known as the teacher model, while the cooled container is referred to as the student model. The distilled liquid in this process represents knowledge, with the model serving as the knowledge repository, as depicted in Figure 2. Different “temperatures” distil different “knowledge” during the knowledge distillation process. Additionally, we can draw a parallel between knowledge distillation and a teacher imparting their knowledge to students in a classroom setting. Consequently, the model before distillation is typically referred to as the teacher model, while the model after is called the student model (Figure 2).
How does knowledge distillation work in practical applications, and what knowledge is being distilled? Let’s take LLM as an example. In knowledge distillation, the process usually involves transferring knowledge from the teacher model, a large language model, to the student model, a deep neural network model. This is accomplished by utilising the intermediate states of the teacher model. In the case of LLM, the intermediate states play a crucial role in converting the distilled knowledge, such as natural language, into vector representations. The student model leverages these intermediate states during the model training process, effectively learning and absorbing the distilled knowledge. By utilising the intermediate states of the teacher model, the student model undergoes a learning process that enables the distillation of knowledge. This allows the student model to acquire the valuable insights and information the teacher model captures.
In summary, knowledge distillation is a technique that aims to transfer knowledge from a teacher model to a student model by utilising the intermediate states of the teacher model during training. This process protects privacy as the original data is only accessible to the teacher model, and the intermediate states cannot be directly interpreted as actual knowledge. Additionally, the student model achieves similar performance to the teacher model and has a smaller model size, resulting in significant reductions in computation and storage costs (typically achieving more than 10 times compression).
Knowledge Distillation: Collaboration of Financial Enterprises under Data Privacy Protection
The Chinese Financial Association of Hong Kong 4 recently organised a special event focused on applying LLMs in the financial field, generating significant interest among financial enterprises. Although renowned AI companies like OpenAI and Bloomberg offer LLM APIs for enterprise use, concerns about data privacy arise due to data transmission. Financial enterprises, therefore, prefer local deployment of LLMs to ensure better privacy protection. In previous discussions, we explored the benefits of knowledge distillation in terms of model compression and privacy protection. In the era of LLMs, how can financial enterprises effectively utilize this method for local deployment while ensuring privacy protection?
One approach is to perform direct distillation using the financial enterprises’ own LLMs, and subsequently distribute the distilled student models to other financial enterprises. This approach allows direct access to the distilled models but may limit their adjustability. It is suitable for financial enterprises with similar services or businesses.
Another approach, applicable to financial enterprises with large language models, involves sharing their knowledge (intermediate states) with other financial enterprises. This shared knowledge can be used for local distillation or model training. The advantage of this approach is that it allows for convenient iteration of distillation models as the transferred knowledge can be updated. However, it requires sufficient model training experience to effectively train the distilled model. This distillation method is suitable for fostering adaptable and personalized collaboration between financial enterprises. Apart from the aforementioned approaches, financial enterprises with LLMs can also offer one-to-many distillation services based on specific business needs. Furthermore, financial enterprises that have already distilled models can further explore and fine-tune them to align with their unique business requirements. In practical applications, knowledge distillation often achieves more than ten times compression in model size while maintaining similar performance levels .
Challenges and Opportunities in the Future
Although we have discussed the advantages of knowledge distillation and its potential application in the era of LLMs in the above content, it is essential to acknowledge some limitations of knowledge distillation. Firstly, knowledge distillation may reduce model performance, although existing research has shown that this loss is relatively small . Additionally, the distillation process requires a certain amount of time, computational power support, and experience in training deep models (although these requirements are significantly lower than the training of LLMs). Therefore, users of knowledge distillation must carefully consider its characteristics before implementing it.
However, considering the ongoing development of LLMs, their performance and model size are expected to continue to increase. By utilising knowledge distillation to balance performance and protect user data privacy, it becomes feasible for small and medium-sized enterprises with limited funds and resources to employ locally distilled LLMs. We believe that further advancements in knowledge distillation can contribute to the improvement of deep learning models used in enterprises, ultimately enhancing the quality of service and user experience.
 Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).
 Gu, Yuxian, et al. “Knowledge Distillation of Large Language Models.” arXiv preprint arXiv:2306.08543 (2023).
 Gou, Jianping, et al. “Knowledge distillation: A survey.” International Journal of Computer Vision 129 (2021): 1789-1819.
The work described in this article was supported by InnoHK initiative, The Government of the HKSAR, and Laboratory for AI-Powered Financial Technologies.
(AIFT strives but cannot guarantee the accuracy and reliability of the content, and will not be responsible for any loss or damage caused by any inaccuracy or omission.)