Latest Issue
The Negative Experiences of Low-Income Citizen Commute and Their Intentions Toward Public Bus in Phnom Penh
Published: December 31,2025Reliability Study on the Placement of Electric Vehicle Charging Stations in the Distribution Network of Cambodia
Published: December 31,2025Planning For Medium Voltage Distribution Systems Considering Economic And Reliability Aspects
Published: December 31,2025Security Management of Reputation Records in the Self-Sovereign Identity Network for the Trust Enhancement
Published: December 31,2025Effect of Enzyme on Physicochemical and Sensory Characteristics of Black Soy Sauce
Published: December 31,2025Activated Carbon Derived from Cassava Peels (Manihot esculenta) for the Removal of Diclofenac
Published: December 31,2025Impact of Smoking Materials on Smoked Fish Quality and Polycyclic Aromatic Hydrocarbon Contamination
Published: December 31,2025Estimation of rainfall and flooding with remotely-sensed spectral indices in the Mekong Delta region
Published: December 31,2025Khmer Question-Answering Model by Fine-tuning Pre-trained Model
-
1. Graduate student, Master of Engineering in Computer Science, Graduate school, Institute of Technology of Cambodia
Received: July 31,2024 / Revised: August 23,2024 / / Accepted: September 04,2024 / Available online: April 30,2025
As artificial intelligence has grown, a large language model is a model train with a vast quantity of textual data. This model can be tailored to a particular task, such as chatbots, text production, and question-answering. However, most of the existing pre-trained models nowadays were trained with English datasets, leading to limited support and low performance in other languages, especially low-resource languages like Khmer. To address the imbalance, we aim to make a Khmer language model by fine-tuning pre-trained state-of-the-art (SOTA) models, with a focus on question-answering tasks. We propose to use supervised fine-tuning process by providing a labeled dataset to train and utilize the quantization technique along with low-range adaptation for memory-efficient optimization. Moreover, we inject flash attention to make the training process faster. Before we start the experiment, we observed that some SOTA models were not able to recognize Khmer language. To deal with this problem, we do vocabulary expansion. To achieve our experiment, we collect datasets from online sources containing question-and-answer pairs in general knowledge domain. The three decoding strategies including greedy search, beam search, and contrastive search use to select the output tokens to generate text. We use bilingual evaluation understudy (BLEU) as an evaluation metric because it measures the similarity between generated responses and referenced sentences. Through the experiment, we obtained the BLEU score of Gemma 7B fine-tuned model increase from 0.0539 to 0.2863 on greedy search, from 0.0227 to 0.2765 on beam search, and from 0.0009 to 0.2201 on contractive search. The increasing showed that the fine-tuning process enhance the performance of model. This score also indicated that the model can generate the clear response but have grammatical error. The findings of this study contribute to the growing research on applying Khmer language with deep learning techniques to make question-answering. In conclusion, this finding will offer a multitude of benefits across various domains. Their ability to understand natural language makes them invaluable tools for businesses, educators, and researchers.
