Student Work

Fine-Tuning Open-Source Large Language Models for Generating Math Explanations

公开 Deposited

可下载的内容

open in viewer

Percy Liang’s article, “We have No Moat,” reveals that open-source large language models (LLMs) with 7 billion parameters are able to rival those of large tech companies with 500 billion parameters. Open-source LLMs have also become more accessible and easier to fine-tune with the rise of open-source resources like Hugging Face. Through the use of prompt engineering and fine-tuning, the goal of this project was to find and evaluate LLMs to potentially match the performance of OpenAI’s GPT-3.5. We aim to help ASSISTments, a non-profit organization that focuses on middle-school math education, in developing open-source LLMs to transition from tedious and somewhat inaccurate hand-written explanations to streamlined automatically generated ones. Open source LLMs offer a more cost-effective option compared to GPT-3.5 and a more time-efficient option compared to generating explanations by hand. ASSISTments has already started working on integrating LLMs into their website, and our focus was on improving the explanation generating LLMs. Leveraging a framework of prompt engineering and fine-tuning LLMs, we tested and evaluated the effectiveness of many models in writing accurate math explanations. During prompt engineering, we double-blinded the responses for each prompt and evaluated each response. This double-blind process allowed us to determine the score in an unbiased manner. Through an iterative process, we were able to see up to 80% improvement with our best prompts compared to just giving a labeled question-answer pair to prompt the LLM. Performing fine-tuning, we determined that we were unable to significantly improve a WizardMath’s mathematical reasoning, but fine-tuning was highly effective in producing consistently formatted answers which gave the explanations more readability compared to the base WizardMath. This framework was ultimately used to compare the performance of 3 LLMs in generating explanations to ASSISTments questions. We found that the fine-tuned model improved the base model by about 5%, while GPT-3.5 outperformed the base model by roughly 45%. Our results show promise in utilizing LLMs for generating accurate and readable explanations. Furthermore, our fine-tuning and prompt engineering framework can be utilized in other fields in which LLMs can be integrated in order to optimize the performance of the LLMs.

  • This report represents the work of one or more WPI undergraduate students submitted to the faculty as evidence of completion of a degree requirement. WPI routinely publishes these reports on its website without editorial or peer review.
Creator
Publisher
Identifier
  • E-project-022824-174239
  • 117968
关键词
Advisor
Year
  • 2024
Date created
  • 2024-02-28
Resource type
Major
Source
  • E-project-022824-174239
Rights statement

关系

属于 Collection:

Permanent link to this page: https://digital.wpi.edu/show/cr56n536x