Khmer Question-Answering Model by Fine-tuning Pre-trained Model
    1. Graduate student, Master of Engineering in Computer Science, Graduate school, Institute of Technology of Cambodia

Received: July 31,2024 / Revised: August 23,2024 / / Accepted: September 04,2024 / Available online: April 30,2025

Download PDF
Browse Figures
×

 As artificial intelligence has grown, a large language model is a model train with a vast quantity of textual data. This model can be tailored to a particular task, such as chatbots, text production, and question-answering. However, most of the existing pre-trained models nowadays were trained with English datasets, leading to limited support and low performance in other languages, especially low-resource languages like Khmer. To address the imbalance, we aim to make a Khmer language model by fine-tuning pre-trained state-of-the-art (SOTA) models, with a focus on question-answering tasks. We propose to use supervised fine-tuning process by providing a labeled dataset to train and utilize the quantization technique along with low-range adaptation for memory-efficient optimization. Moreover, we inject flash attention to make the training process faster. Before we start the experiment, we observed that some SOTA models were not able to recognize Khmer language. To deal with this problem, we do vocabulary expansion. To achieve our experiment, we collect datasets from online sources containing question-and-answer pairs in general knowledge domain. The three decoding strategies including greedy search, beam search, and contrastive search use to select the output tokens to generate text. We use bilingual evaluation understudy (BLEU) as an evaluation metric because it measures the similarity between generated responses and referenced sentences. Through the experiment, we obtained the BLEU score of Gemma 7B fine-tuned model increase from 0.0539 to 0.2863 on greedy search, from 0.0227 to 0.2765 on beam search, and from 0.0009 to 0.2201 on contractive search. The increasing showed that the fine-tuning process enhance the performance of model. This score also indicated that the model can generate the clear response but have grammatical error. The findings of this study contribute to the growing research on applying Khmer language with deep learning techniques to make question-answering. In conclusion, this finding will offer a multitude of benefits across various domains. Their ability to understand natural language makes them invaluable tools for businesses, educators, and researchers.