Leveraging LLMs: ChatGPT May Not Be Your Best Friend
Upon its release, ChatGPT left an indelible mark. The impactful chatbot swiftly asserted its dominance in both personal and professional discourse, captivating minds, raising concerns, and centering transformative conversations.
From inquiries about the theory of relativity to practical advice on car maintenance, and the intricacies of managing workplace stress, ChatGPT answers an array of questions with real-time promptness. Its capacity to deliver coherent and insightful responses to such diverse queries captivated the public’s imagination, setting a new standard for conversational AI.
From the outset, the software model aspired to evolve into a formidable repository of knowledge and a powerhouse for AI content generation. Yet, this monumental achievement is not without its drawbacks, as ChatGPT’s most remarkable feature may also be the very factor that deters it from becoming your company’s preferred Large Language Model (LLM).
LLMs: From Impact to Limitations
LLMs are deep learning algorithms capable of handling various natural language processing (NLP) tasks. While the concepts of NLP and LLMs are not new, ChatGPT distinguishes itself with two remarkable features that have propelled it to the forefront of AI technology, both embedded within its very name.
ChatGPT, short for Generative Pre-trained Transformer, serves as the gateway that ushered LLMs into mainstream usability. It surpasses conventional NLP capabilities by comprehending and addressing user queries and crafting coherent and contextually relevant content in response to prompts. With ChatGPT, you don’t just receive existing content; you witness the generation of new content at your command.
The model’s pre-training involved training the model on a vast dataset comprising a colossal 45 terabytes, encompassing a staggering 500 billion tokens or words. This extensive pre-training is the source of ChatGPT’s remarkable capabilities, as it harnesses a wealth of general knowledge.
The creators behind ChatGPT, OpenAI, developed a model with a large well of knowledge, coupled with its generative abilities that can answer a substantial number of questions in a conversational format. There remains an ongoing debate as to the actual appeal of ChatGPT in comparison to other LLM iterations. Is it the user interface’s ease of use, rather than the data itself, that accounts for its rapid adoption?
This ease of use was intentional from the beginning phases of the design for the model. The chat interface is highly effective at comprehending the intentions behind your queries. As the cofounder of OpenAI said, “So you don’t have to be the one who spells out every single sort of little piece of what’s supposed to happen.”
What is often not discussed, is the cost of this new tool; both financially and computationally. Processing that much data is expensive. After the reported 4 million dollars it took to train the software, analysts estimate daily costs are upwards of $700,000. Most of the cost is based on the expensive servers the model requires. The next iteration of the software, GPT-4, shows no indication of being cheaper.
It’s also important to acknowledge ChatGPT’s limitations, particularly in terms of the prompts it can handle, and the data sources it can draw from when generating content. These limitations have centered conversations regarding intellectual property and copyright issues. Plus, users familiar with Chat are aware that while its responses are impressive, they may still require verification and editing.
When delving into LLMs, it’s crucial to recognize that no single model is inherently superior. Instead, each model serves a distinct purpose and the choice of implementation hinges on the specific objectives. In the context of technical fields like engineering requirements, where precision is paramount, the question of whether to opt for the convenience of ChatGPT or to seek a bespoke solution becomes pertinent.
Opportunities in the Requirement World
When ChatGPT garnered attention within the requirement-authoring industry, it generated a sense of anticipation. The model served as an auspicious initial point of exploration. Users had the option to delegate the onerous task of requirement authoring to this AI entity. Nevertheless, concerns emerged regarding the quality of the requirements thus generated.
At QRA we routinely engage with requirements in a substantial capacity. We have, in the past, automated the facets of quality and consistency in requirement authoring through our proprietary software, QVscribe. However, the potential for ChatGPT to automate the authoring aspect remained an open question that necessitated validation.
To evaluate ChatGPT’s ability in requirement generation, we initiated a prompt: “Could you compose fifty functional engineering requirements for an automobile, such as a Honda Civic?” The AI promptly generated the requested set of requirements. At a superficial glance, these requirements appeared reasonable, yet a comprehensive analysis conducted using QVscribe revealed a contrasting narrative.
After subjecting all fifty requirements to scrutiny and assessment, the software yielded an overall quality score of 2 out of 5. Such a rating is deemed unacceptable within the requirement industry due to the high risk associated with the requirements.
It is important to note that QVscribe employs INCOSE and industry standards as the benchmark for evaluating requirement quality. The scoring system employed by QVscribe speaks to the quality and correctness of a requirement. Receiving a 5/5 signifies a quality requirement with a high likelihood of correctness, while a 0/5 score indicates low quality and a high likelihood of errors. We recommend a baseline 4/5 score for all of requirements within a document, excluding any specific, necessary deviations from industry standards.
Remarkably, only three of the fifty requirements generated by ChatGPT received a 5/5 score. Numerous issues were identified, such as missing imperatives, superfluous infinitives, and the presence of vague terminology, which permeated the majority of these requirements.
In defense of ChatGPT, it is worth acknowledging that these are common errors prevalent throughout the requirement authoring industry, even occasionally made by experienced engineers, albeit with reduced frequency. Therefore, the model cannot be deemed entirely inept at requirement authoring; rather, it performs slightly below the caliber of a subpar engineer.
The Role of Fine-Tuned LLMs
It is a testament to the model’s processing capability that it can comprehend and execute such requests. However, the deficiency in the pre-training and training material becomes evident when assessing the poor quality of the requirements. Recall that the model was trained on a large amount of general text, not specifically well-written requirements. Writing requirements is something the model is capable of doing, not its area of expertise.
It is reasonable to assume that ChatGPT could yield higher-quality requirements if provided with more detailed information in the initial prompt, such as specific guidance to “Follow INCOSE standards,” “Eliminate the use of vague terminology,” or “Ensure that requirements are atomic.”
6
Furthermore, most requirement authors could potentially guide ChatGPT toward producing quality requirements through an iterative process of issuing prompts, reviewing results, and issuing subsequent prompts.
However, such a prompt cycle drains valuable time and resources. Repeatedly inputting extensive context-relevant information into the model for each requirement generation instance proves to be highly inefficient. While the chat interface is well-suited for general inquiries and prompts, it is not optimized for crafting complex engineering documents. The cost of time is very likely to outweigh the benefits that the LLM would provide in this instance.
This is where organizations should recognize the potential of harnessing the capabilities of smaller, finely tuned LLMs. A finely tuned LLM ensures that the generated content adheres to organizational and industry standards without the need for an extensive prompt development cycle. The process of extracting requirements relevant to your specific objectives becomes significantly simplified and accelerated. A focused model could streamline the requirement development process, handling the majority of the authoring workload for your requirement engineers. Review and revision would remain integral to your workflow, but the authoring and review times could be substantially reduced.
Moreover, these benefits come at a reduced computational cost. As previously noted, the diminished computational resources required to operate a smaller LLM render it a more cost-effective and strategic choice for companies. Although a subscription to ChatGPT may not entail the same initial cost as a customized LLM, the return on investment is notably superior.
In the foreseeable future, a rising amount of Software as a Service (SaaS) enterprises will be committed to the development and optimization of these tailored LLMs, marking a growing trend in the technological landscape. While expansive, well-established models like ChatGPT will undoubtedly maintain their prominence and utility, early adopters will detect the advantages of investing in a LLMs serves their goals. Capitalizing early and effectively is the key to harnessing the power of LLMs. Grammarly is already marketing an that promises to add value to multiple sectors of an organization. It is merely a question of when, not if, other entities will embark on the creation of Large Language Models (LLMs) tailored to their respective industries.
Final Thoughts on LLMs
The fate of LLMs resides in striking a balance. Understanding your company’s specific requirements and recognizing the limitations of certain LLMs is crucial for a successful and strategically sound implementation. While a large, general model like Chat-GPT may suffice for some companies, those seeking quality, assurance, and alignment with their unique standards should venture beyond the standard offerings and explore the potential of smaller yet highly effective LLMs.