If humans make mistakes and build software with errors and therefore generate vulnerabilities, why would artificial intelligence not make mistakes if it was trained by humans?
Authors: Raphael Khoury – Anderson R. Avila – Jacob Brunelle – Baba Mamadou Camara
Abstract—In recent years, large language models have been responsible for great advances in the field of artificial intelligence (AI). ChatGPT in particular, an AI chatbot developed and recently released by OpenAI, has taken the field to the next level.
The conversational model is able not only to process human-like text, but also to translate natural language into code. However, the safety of programs generated by ChatGPT should not be overlooked. In this paper, we perform an experiment to address this issue. Specifically, we ask ChatGPT to generate a number of program and evaluate the security of the resulting source code. We further investigate whether ChatGPT can be prodded to improve the security by appropriate prompts, and discuss the
ethical aspects of using AI to generate code. Results suggest that ChatGPT is aware of potential vulnerabilities, but nonetheless often generates source code that are not robust to certain attacks.
Index Terms—Large language models, ChatGPT, code security, automatic code generation.
For years, large language models (LLM) have been demonstrating impressive performance on a number of natural language processing (NLP) tasks, such as sentiment analysis, natural language understanding (NLU), machine translation (MT) to name a few. This has been possible specially by means of increasing the model size, the training data and the model complexity . In 2020, for instance, OpenAI announced GPT3 , a new LLM with 175B parameters, 100 times larger than GPT-2 . Two years later, ChatGPT , an artificial intelligence (AI) chatbot capable of understanding and generating human-like text, was released. The conversational AI model, empowered in its core by an LLM based on the Transformer
architecture, has received great attention from both industry and academia, given its potential to be applied in different downstream tasks (e.g., medical reports , code generation , educational tool , etc).
Besides multi-turn question answering (Q&A) conversations, ChatGPT can translate human-like text into source code. The model has the potential to incorporate most of the early Machine Learning (ML) coding applications, e.g., bug detection and localization , program synthesis , code summarization  and code completion . This makes the model very attractive to software development companies
that aim at increasing productivity while minimizing costs.
It can also benefit new developers that need to speed up their development process or more senior programmers that wish to alleviate their daily tasks. However, the risk of developing and deploying source code generated by ChatGPT is still unknown.
Therefore, this paper is an attempt to answer the question of how secure is the source code generated by ChatGPT.
Moreover, we investigate and propose follow-up questions that can guide ChatGPT to assess and regenerate more secure source code.
In this paper, we perform an experiment to evaluate the security of code generated by ChatGPT, fine-tuned from a model in the GPT-3.5 series. Specifically, we asked ChatGPT to generate 21 programs, in 5 different programming languages: C, C++, Python, html and Java. We then evaluated
the generated program and questioned ChatGPT about any vulnerability present in the code. The results were worrisome. We found that, in several cases, the code generated by ChatGPT fell well below minimal security standards applicable in most contexts. In fact, when prodded to whether or not the produced
code was secure, ChatGTP was able to recognize that it was not. The chatbot, however, was able to provide a more secure version of the code in many cases if explicitly asked to do so.
The remainder of this paper is organized as follows. Section II describes our methodology as well as provides an overview of the dataset. Section III details the security flaws we found in each program. In Section IV, we discuss our results, as well as the ethical consideration of using AI models to generate
code. Section VI surveys related works. Section V discusses threats to the validity of our results. Concluding remarks are given in Section VII.
II. STUDY SETUP
In this study, we asked ChatGPT to generate 21 programs, using a variety of programming languages. The programs generated serve a diversity of purpose, and each program was chosen to highlight risks of a specific vulnerability (eg. SQL injection in the case of a program that interacts with database, or memory corruption for a C program). In some cases, our instructions to the chatbot specified that the code
would be used in a security-sensitive context. However, we elected not to specifically instruct ChatGPT to produce secure code, or to incorporate specific security features such as input sanitization. Our experience thus simulates the behavior of a novice programmer who asks the chatbot to produce code on his behalf, and who may be unaware of the minutiae required to make code secure.
We then prodded ChatGPT about the security of the code it produced. Whenever a vulnerability was evident, we created
a) Program 1: is a simple C++ FTP server to share files located in a public folder. The code generated by chatGPT performs no input sanitization whatsoever, and is trivially vulnerable to a path traversal vulnerability.
When prompted about the behavior of the program on a malicious input, ChatGPT readily realized that the program is vulnerable to a path traversal vulnerability, and was even able to provide a cogent explanation of the steps needed to secure the program.
However, when asked to produce a more secure version of the program, ChatGTP merely added two sanitization checks to the code: a first check to ensure that the user input only contains alphanumeric characters and a second test to ensure that the path of the shared file contains the path of the shared
folder. Both tests are relatively simple and easy to circumvent by even a novice adversary.
b) Program 2: is a C++ program that receives as input an email address, and passes it to a program (as a parameter) through a shell. As discussed by Viega et al. , handling input in this manner allows a malicious adversary to execute arbitrary code by appending shell instructions to a fictitious email.
As was the case in the previous example, when asked about the behavior of the program on a malicious input, ChatGPT realizes that the code is vulnerable. In this case, the behavior is only triggered by a crafted input, so only a user who is already aware of the security risk would ever ask about this situation. However, ChatGPT is then able to provide an explanation as to why the program is vulnerable
and create a more secure program. The corrected program exhibits some input validation tests, but they are fairly limited and the program remains vulnerable—a situation that is hard to avoid considering how risky it is to feed a user-input directly to the command line. Creating a truly secure program would
probably require a more fundamental modification of the code, which is beyond the capabilities of a chatbot tasked with responding to user requests. This use-case raises interesting ethical issues since it may be argued that the instructions given to ChatGPT (i.e., passing the user’s input to the program as a
parameter) are inherently unsafe. We will return to this issue in Section IV.
c) Program 3: is a python program that receives a user input and stores it in an SQL database. The program performs no code sanitization, and is trivially vulnerable to an SQL injection. However, when asked about the behavior of the program on a textbook SQL injection entry, ChatGPT identified the vulnerability, and proposed a new version of the code that uses prepared statements to perform the database update securely.
d) Program 4: is a C++ program that receives as input a user-supplied username and password, and checks that the username is not contained in the password using a regex. This process exposes the host system to a denial of service by way of a ReDos attack  if an adversary submits a crafted input
that requires exponential time to process.