How to Fine-Tune an LLM

The rise of digital engagement has significantly influenced customer expectations in 2024 as Statista mentions. Clients expect contact centers to solve their issues 24/7. Large language models (LLMs) stand behind the effective performance of AI chatbots as they provide the data for reliable and truthful responses. Fine-tuning makes a general LLM an expert in certain business domains. As a result, fine-tuned LLMs can generate the output on the basis of current trends and news.

In this article, Belitsoft’s Chief Innovation Officer (CINO), Dmitry Baraishuk, shares his expertise in fine-tuning and its best practices. Software development companies like Belitsoft offer full-cycle LLM training services to enhance generative AI apps, ranging from custom AI chatbots to complex virtual domain-focussed assistants. ML experts fine-tune LLMs using proprietary datasets and employ prompt design strategies to achieve the best outputs.

What is an LLM?

Large language models are neural networks that have been trained on large volumes of data. That data may include texts from the Internet and internal corporate information of various formats. The model “examines” how the patterns of data are organized in human texts and learns to anticipate the next word (token) on the basis of the training data. The natural language processing (NLP) of the LLMs demonstrates human-like capabilities for maintaining a conversation. Models use jargon and words of humans and make the interaction natural, which increases customer satisfaction.

LLMs can be divided into three components: architecture, input data, and tokenizers.

Architecture

The architecture of the LLMs is known as transformer architecture with encoding and decoding components. Different models use those components differently. For example, GPT models are based on sequence-generating architecture and use only decoding components. BERT uses encoding architecture with masking.

LLM Input Data

The performance of the LLM depends on the data that it trains on. Some projects demand pre-training as the first stage of the training process. At this stage, models learn to use a large amount of unstructured data. Data cleaning steps include filtering out samples that are too short, removing repetitions within paragraphs, and deduplication of entries. After cleaning, the dataset usually reduces in volume. The cleaned data is saved and prepared for tokenizing.

Tokenizers

Tokenizing is transforming text into numbers using a specific tokenizer and packing the tokens into continuous sequences of maximum model length to make training efficient. Tokens are parts of the vocabulary, e.g., words, syllables, etc., that the model uses to understand the text.

Why Do LLMs Hallucinate?

As we have already mentioned, LLMs generate the output that is statistically the most likely prediction of the text that follows the input. If the prompt is the beginning of some document, the model will provide you with the continuation of this document. However, sometimes models become misleading. It means, they give a meaningful answer, which appears factually incorrect. Such performance of the LLM is called hallucinations. To cope with them, users can provide wider context in the prompts and avoid biases in the input. Large training datasets and tokens, as well as sufficient computing powers, help to receive high-quality output.

What Is Fine-Tuning?

Fine-tuning is the process of improving the LLM with additional data after it has already been trained. Machine learning (ML) engineers teach the model how to behave in new circumstances and unfamiliar domains.

Stages of Fine-Tuning an LLM

Fine-tuning is a complex process that includes several stages. By following them, engineers receive a refined model that generates contextually appropriate responses.

Data preparation: cleansing and formatting the data to address the target task, such as understanding instructions, sentiment analysis, and mapping the topic.
Model initialization: customizing the initial parameters of the LLM. ML engineers need it to make sure the model performs correctly, is able to train, and doesn’t experience issues like vanishing or exploding gradients.
Training setup: preparing the environment for LLM training. Experts choose relevant data and determine the architecture and hyperparameters. They also configure the model’s weights and biases to adapt the model to certain tasks.
Fine-tuning: full or partial adjusting of the model. Data scientists apply one of the two different approaches to fine-tuning. “Catastrophic forgetting” means that the model forgets the weights of the previous training. All the weights are completely updated. The other approach allows for freezing some of the neural layers and updating only the selected parts. It is known as the parameter-efficient fine-tuning (PEFT). The second approach addresses such issues as substantial computational resources and overfitting.
Validation and assessment: applying special metrics to evaluate the results. ML experts use cross-entropy metrics to measure prediction errors and observe loss curves to find cases of over- or underfitting.
Deployment: implementing the model into applications. IT specialists check that the model runs smoothly on intended hardware or software platforms. This stage also involves integration customization and setting up security measures.
Monitoring: keeping track of the model’s performance. This is a process of continuous observation of the performance, fixing the issues that arise, and updating the model when new data or requirements appear.

Fine-Tuning with LoRA

According to the statistics, the global need for AI potential will grow, which means LLMs will grow correspondingly. Larger LLMs will need more resources, limiting the capacity of smaller companies and startups. It means large corporations will likely run most of the models. In relation to custom fine-tuning of LLMs, to reduce costs, companies will turn to the PEFT approach in fine-tuning. It allows for updating only a few model parameters, reducing memory usage and enabling fine-tuning on consumer-grade hardware. One of the techniques illustrating this approach is Low-Rank Adaptation (LoRA).

LoRA offers developers several benefits:

It reduces the hardware resource demands, as storing the full-weight matrix in memory becomes unnecessary. Instead, it adds and optimizes much smaller, low-rank matrices that represent changes to the original model weights.
It operates quicker because of its linear structure and engineers can integrate adjustable matrices with fixed weights.
Developers create several small models on the basis of a pre-trained model. LoRA allows them to save and load different matrices and switch between tasks.
To use LoRA, there is no need to change the model’s architecture. It is integrated with current pre-trained models.
If it is required, experts can further improve the performance of the model with adapter layers or prompting.

Useful Tools for Workflow Optimization

LoRA is one of the techniques used in the PEFT open-source library Hugging Face (HF). This library provides developers with a platform and a community to realize their AI ideas from the plan to deployment quickly and cheaply. Hugging Face offers ML engineers a variety of pre-trained models that they can use for different AI tasks, such as NLP, computer vision, and audio processing. Types of AI apps may include chatbots, AI translation, classification of images and texts, speech recognition assistants, etc.

Software developers don’t need to develop a model and train it from scratch. Instead, they choose an available pre-trained model, embed it into their projects with minimal code, and use the HF training library. It helps to train models faster and reuse existing models several times which saves resources.

HF is not only a collection of models. It also contains tools for convenient ML development. For example, Transformers is a library of model classes, tokenizers, and APIs for quick implementation of the models into any project’s existing code. Developers can use HF’s diffusers and accelerators for image modeling and mixed-precision training. HF contains more than 200,000 datasets with images, text, audio, etc. ML teams use those datasets to train their models. Besides, they benefit from evaluation libraries to validate the performance of the model at different stages.

HF offers additional paid services. For example, HF manages Inference Endpoints, i.e., they deploy the model on their servers and supply a scalable API endpoint.

HF Hub is a repository system hosted in the cloud for AI models, datasets, and demonstrations. It’s a public marketplace that allows developers to manage versions, discuss issues, and contribute to the creation of ML artifacts, just like it works with code collaboration on GitHub. This becomes handy in the later stages of development, as teams store and manage models, share them internally or publicly, and collaborate with the open source development community.

Such a rich collection of tools is a treasure trove for startups. They can study the documentation and examples and implement sophisticated ML functionality with cut code.

Conclusion

Before tackling fine-tuning an LLM, ask yourself how much training data is required for your LLM, how you are going to clean it, how many hours/days you are ready to spend fine-tuning the LLM by yourself, and what are the pros and cons of addressing a software firm.