Latest Issue
Effect of Different Irrigation Methods on Water Use Efficiency in Rice Soil Column Test
Published: April 30,2025Optimization of Extraction Condition for Oleoresin from Red Pepper Residues
Published: April 30,2025Bus Arrival Time Prediction Using Machine Learning Approaches
Published: April 30,2025A Deep Learning Approach for Identifying Individuals Based on Their Handwriting
Published: April 30,2025Khmer Question-Answering Model by Fine-tuning Pre-trained Model
Published: April 30,2025CNN-based Reinforcement Learning with Policy Gradient for Khmer Chess
Published: April 30,2025Khmer Question-Answering Model by Fine-tuning Pre-trained Model
-
1. Graduate student, Master of Engineering in Computer Science, Graduate school, Institute of Technology of Cambodia
Received: July 31,2024 / Revised: August 23,2024 / / Accepted: September 04,2024 / Available online: April 30,2025
As artificial intelligence has grown, a large language model is a model train with a vast quantity of textual data. This model can be tailored to a particular task, such as chatbots, text production, and question-answering. However, most of the existing pre-trained models nowadays were trained with English datasets, leading to limited support and low performance in other languages, especially low-resource languages like Khmer. To address the imbalance, we aim to make a Khmer language model by fine-tuning pre-trained state-of-the-art (SOTA) models, with a focus on question-answering tasks. We propose to use supervised fine-tuning process by providing a labeled dataset to train and utilize the quantization technique along with low-range adaptation for memory-efficient optimization. Moreover, we inject flash attention to make the training process faster. Before we start the experiment, we observed that some SOTA models were not able to recognize Khmer language. To deal with this problem, we do vocabulary expansion. To achieve our experiment, we collect datasets from online sources containing question-and-answer pairs in general knowledge domain. The three decoding strategies including greedy search, beam search, and contrastive search use to select the output tokens to generate text. We use bilingual evaluation understudy (BLEU) as an evaluation metric because it measures the similarity between generated responses and referenced sentences. Through the experiment, we obtained the BLEU score of Gemma 7B fine-tuned model increase from 0.0539 to 0.2863 on greedy search, from 0.0227 to 0.2765 on beam search, and from 0.0009 to 0.2201 on contractive search. The increasing showed that the fine-tuning process enhance the performance of model. This score also indicated that the model can generate the clear response but have grammatical error. The findings of this study contribute to the growing research on applying Khmer language with deep learning techniques to make question-answering. In conclusion, this finding will offer a multitude of benefits across various domains. Their ability to understand natural language makes them invaluable tools for businesses, educators, and researchers.