{"id":94597,"date":"2023-08-21T13:01:53","date_gmt":"2023-08-21T13:01:53","guid":{"rendered":"https:\/\/www.techopedia.com"},"modified":"2023-10-31T09:47:49","modified_gmt":"2023-10-31T09:47:49","slug":"insights-breaking-down-the-transformative-journey-of-gpt-models-in-ai-from-gpt-1-to-gpt-4","status":"publish","type":"post","link":"https:\/\/www.techopedia.com\/gpt-series-evolution-insights","title":{"rendered":"Insights: Breaking Down the Transformative Journey of GPT Models in AI, from GPT-1 to GPT-4"},"content":{"rendered":"
Artificial intelligence<\/span><\/a> (AI) has seen major changes since the <\/span>Chat Generative Pre-trained Transformer<\/span><\/a> (GPT) series started in 2018.<\/span><\/p>\n Successive models brought enhancements, upgrades, and challenges, capturing the interest of enthusiasts, researchers, and users. From GPT-1’s basic text creation to GPT-4’s diverse skills, the progress is evident. Continuous studies examine these models’ actions, shedding light on their changing skills and possible issues.<\/span><\/p>\n This article covers the growth and study of the chat generative pre-trained transformer models. It centers on their performance scores and insights from different tests.<\/span><\/p>\n An essential aspect of understanding the advancements in the GPT series is the training computation, often gauged in total FLOP (floating-point operations). A FLOP represents basic math operations such as addition, subtraction, multiplication, or division performed with two decimal numbers.\u00a0<\/span><\/p>\n When it comes to scale, one <\/span>petaFLOP<\/span><\/a> equals a staggering quadrillion (10^15) FLOP. This measure of computation showcases the vast resources invested in training these models.<\/span><\/p>\n\n GPT-1, introduced in June 2018, marked the inception of the generative pre-trained transformer model series. This laid the groundwork for the ChatGPT of today. GPT-1 showcased the potential of unsupervised learning in language understanding, predicting the next word in sentences using books as training data.<\/span><\/p>\n GPT was trained using 17,600 petaFLOPs.\u00a0<\/span><\/p>\n In February 2019, GPT-2 emerged as a significant upgrade to the generative pre-trained transformer series. It exhibited substantial improvements in text generation, producing coherent, multi-paragraph content. However, due to potential misuse concerns, GPT-2’s public release was initially withheld. It was eventually launched in November 2019 after <\/span>OpenAI<\/span><\/a>‘s careful risk assessment.<\/span><\/p>\n GPT-2 was trained using 1.49 million petaFLOPs.\u00a0<\/span><\/p>\n GPT-3, a monumental leap in June 2020. Its <\/span>advanced text generation<\/span><\/a> found applications in email drafting, article writing, poetry creation, and even <\/span>programming<\/span><\/a> code generation. It demonstrated capabilities in answering factual queries and language translation.<\/span><\/p>\n GPT-3 was trained using 314 million petaFLOPs.\u00a0<\/span><\/p>\n GPT-3.5 is an improved version of GPT-3, released in 2022. This generative pre-trained transformer model has fewer parameters and uses fine-tuning for better <\/span>machine learning<\/span><\/a> (ML). This involves reinforcement learning with human feedback to make the algorithms more accurate and effective. GPT-3.5 is also designed to follow ethical values, making sure that the AI it powers is safe and reliable for humans to use.<\/span><\/p>\n This model is offered for free use by OpenAI. The number of petaFLOPs used for training is not available.\u00a0<\/span><\/p>\n GPT-4, the most recent version, carries forward the trend of remarkable advancement, introducing enhancements such as:<\/span><\/p>\n This model is offered to ChatGPT Plus subscribers.\u00a0<\/span><\/p>\n GPT-4 was trained using 21 billion petaFLOPs.\u00a0<\/span><\/p>\n A research paper emerged from Stanford University and the University of California, Berkeley, that highlighted the <\/span>shifts in GPT-4 and GPT-3.5\u2019s outputs as time progressed<\/span><\/a>. The paper suggests that there has been an overall decline in the performance of these generative pre-trained transformer models.\u00a0<\/span><\/p>\n Lingjiao Chen, Matei Zaharia, and James Zou studied OpenAI’s models by using <\/span>API<\/span><\/a> access to examine the models from March and June 2023. They conducted tests to understand the generative pre-trained transformer models\u2019 evolution and adaptability over time.<\/span><\/p>\n The researchers wanted to check whether GPT-4 and GPT-3.5 can tell whether numbers are prime or not. They used 1,000 questions for this test, where half of them were prime numbers from a list extracted from <\/span>another paper<\/span><\/a>. The other half were picked from numbers between 1,000 and 20,000.\u00a0<\/span><\/p>\n A method called <\/span>Chain-of-Thought<\/span><\/a> (CoT) was used to help the generative pre-trained transformers think. This method breaks the task down, first by checking if a number is even, second by finding its square root, and third by seeing if smaller prime numbers can divide it.<\/span><\/p>\n These were the results:\u00a0<\/span><\/p>\n GPT-4<\/strong>:<\/span><\/p>\n GPT-3.5<\/strong>:<\/span><\/p>\n The test aimed to check how well ChatGPT can identify happy numbers within a set range. A happy number is when you keep adding the squares of its digits, and you end up with 1.\u00a0<\/span><\/p>\n For example, 13 is a happy number because 1 squared plus 3 squared equals 10, and then 1 squared equals 1.\u00a0<\/span><\/p>\n The study focused on this because it’s a clear-cut question, unlike others that might have yes or no answers. It’s also just about simple math.<\/span><\/p>\n For this test, 500 questions were created. Each question asked about how many happy numbers are in a certain range. The range’s size varied, and its start point was picked from numbers between 500 and 15,000. The test used CoT to help with logical thinking.<\/span><\/p>\n These were the results:\u00a0<\/span><\/p>\n GPT-4:<\/b><\/p>\n GPT-3.5:<\/b><\/p>\n This test looked at how the generative pre-trained transformer models handled sensitive questions. A set of 100 sensitive questions was made for this, with questions that could be harmful or controversial. Therefore, models should avoid direct answers.\u00a0<\/span><\/p>\n The researchers used manual labeling to see if a model answered a question directly.<\/span><\/p>\n These were the results:\u00a0<\/span><\/p>\n GPT-4:<\/b><\/p>\n GPT-3.5:<\/b><\/p>\n This test examined how the language models’ opinion biases changed over time using the OpinionQA dataset. This set had 1,506 opinion questions from top public polls. Questions were in multiple-choice style, and the models were told to “Pick the best single option.”<\/span><\/p>\n The main goal was to see whether the generative pre-trained transformer models were ready to give opinions.<\/span><\/p>\n These were the results:\u00a0<\/span><\/p>\n GPT-4<\/b>:<\/span><\/p>\n GPT-3.5<\/b>:<\/span><\/p>\n To study how well <\/span>large language models<\/span><\/a> (LLMs) can answer complex multi-hop questions, the researchers used an approach called the LangChain <\/span>HotpotQA<\/span><\/a> Agent. This approach involved having LLMs search through Wikipedia to find answers to intricate questions.\u00a0<\/span><\/p>\n The agent was then assigned the task of responding to each query in the HotpotQA dataset.<\/span><\/p>\n These were the results:\u00a0<\/span><\/p>\n GPT-4:<\/b><\/p>\n GPT-3.5:<\/b><\/p>\n To assess the code generation capabilities of LLMs without the risk of data contamination, a novel dataset was curated using the latest 50 problems categorized as “easy” from <\/span>LeetCode<\/span><\/a>. These problems are equipped with solutions and discussions that were made public in December 2022.\u00a0<\/span><\/p>\n The generative pre-trained transformer models were presented with these problems, along with the original descriptions and <\/span>Python<\/span><\/a> code templates.<\/span><\/p>\n The code generated by the LLMs was directly submitted to the LeetCode online judge for assessment. If the generated code was accepted by the judge, it signified that the code adhered to Python’s rules and successfully passed the judge’s designated tests.<\/span><\/p>\n These were the results:\u00a0<\/span><\/p>\n GPT-4:<\/b><\/p>\n GPT-3.5:<\/b><\/p>\n This test set out to evaluate the progress of GPT-4 and GPT-3.5 in a specialized field \u2013 the <\/span>USMLE<\/span><\/a>The Evolution of the Generative Pre-Trained Transformer Series<\/span><\/h2>\n
Launch of GPT in 2018<\/span><\/h3>\n
The leap to GPT-2 in 2019<\/span><\/h3>\n
The revolutionary GPT-3 in 2020<\/span><\/h3>\n
GPT-3.5’s Impact\u00a0<\/span><\/h3>\n
Introduction of the multimodal GPT-4 in 2023<\/span><\/h3>\n
\n
GPT-3.5 vs. GPT-4: A Research Study<\/span><\/h2>\n
Prime vs. Composite Numbers\u00a0<\/span><\/h3>\n
\n
\n
Happy Numbers<\/span><\/h3>\n
\n
\n
Sensitive\/Dangerous Questions<\/span><\/h3>\n
\n
\n
Opinion Surveys<\/span><\/h3>\n
\n
\n
Multi-hop Knowledge-intensive Questions<\/span><\/h3>\n
\n
\n
Generating Code<\/span><\/h3>\n
\n
\n
Medical Exam<\/span><\/h3>\n