Generative AI: A Threat to Copyright?

There has been much debate about ChatGPT worldwide, with concerns being raised over copyright issues and who owns the generated content
At CamIDF, students, journalists, AI and digital law researchers discussed ChatGPT: Opportunities and Challenges. Kiripost/Siv Channa
At CamIDF, students, journalists, AI and digital law researchers discussed ChatGPT: Opportunities and Challenges. Kiripost/Siv Channa

ChatGPT, a generative artificial intelligence (AI) chatbot developed by OpenAI, provides many new opportunities, yet concerns have been raised over copyright issues as it seems unclear who owns the generated content: the users, tech companies or data owners?

ChatGPT generates search results based on the data available on the Internet and the data users input in the conversation. Generative AI uses sources of data to generate responses without the owners’ approval, therefore, potentially causing copyright issues.

At a forum at Sunway Phnom Penh Hotel on Monday, via online presentation, Poren Chiang, a Taiwanese digital law researcher, alerted the public of the threat generative AI poses to copyright law.

“In their terms of use, many companies do not state that they have copyrights from data owners”, said Poren. “This is problematic.”

Cambodian Academics Debate Chatbot Pros and Cons

The forum, under the theme “ChatGPT: Opportunities and Challenges,” was organized by CamIDF, and saw about 100 students, journalists, AI and digital law researchers, academia and relevant stakeholders attend.

Poren quoted a well-known adage that there is no such thing as free lunch, and said this is true when it comes to generative AI. Instead of charging users for using their platforms, the companies sell users’ data.

Poren said, “For those who do not write, draw, and speak many languages, generative AI is really helpful. But, where the data is sourced matters.”

Generative AI has access to a vast amount of data and along with the efficiency of its powerful AI model, ChatGPT outperforms humans in many ways. The data is analyzed, without users’ approval, and sold to business owners via advertisement.

Poren said, ”In order for Google, Bing, and Copilot to generate search results, they needed to use uploaded data on their platform, and they used that data to train the AI model.”

On Facebook, Instagram, and Google platforms, data scraping and data analyses are used to target ads to their users. In 2021, Facebook earned $117.9 billion in revenue. In 2020, Alphabet, the parent company of Google, revealed 80 percent of its total revenue came from advertisements, the equivalent of $224.47 billion.

Data scraping is not a new concept. Search engines, and generative AI, such as Google, Bing, and Copilot, use whatever data is available on their platforms to train their models to generate search results.

However, there is no clear legal statement about who owns the search results generated by generative AI. That is why there are currently court battles taking place.

Despite the opportunities generative AI provides, there are legal issues. As the number of people using ChatGPT skyrocketed, data scraping became a legal issue because the data might be illegal and unethical. For example, explicit content, child pornography, stolen articles, stolen software code, and personal information.

The problem with generative AI lies on who owns the source of the data and how search results are generated.

“There are arguments that AI tries to copy from the original data. It is a kind of copy that does not generate new stuff, it just copies the original stuff and mixes it with others. There is a legal possibility for your AI generated content to be considered the copyright work of the original author,” Poren said to forum participants.

In court, one case is used in comparison to another to convince judges that the two cases are the same. If many people use AI generated work and claim it as their own, it will be problematic because it means there is no copyright of the content.

For example, when you ask an AI chatbot to write an article about Cambodia, there is a possibility that it will copy from Cambodia ICT, Digital Forum (CamIDF) and ODC websites and rephrase it. So, CamIDF can disclose that the text chatbot generated is copyrighted by their authors.

In some court cases, artists are arguing that the AI chatbot model generates search results using decompression technique, a process in which an image is compressed into a different mathematical parameter but the image remains almost the same.

In early 2023, Sarah Andersen, an American cartoonist, illustrator and founder of Sarah’s Scribbles filed a lawsuit against Stable Diffusion, owned by Stability AI, accusing the company of using her artwork without permission.

She believes that art is deeply personal and AI erases humanity from it. She asked what are the intentions of AI users?

In addition, copyright is important and should be protected because the rights were invented to boost innovation.

Besides copyright violation, there are also concerns about privacy violation and personal safety. However, there are some laws that restrict the use of personal information, such as the California Private Act and the Privacy Law in Taiwan.

Based on these laws, users can request platforms to remove personal and sensitive information, such as addresses, family members' names, car information, and passwords, to mention a few.

Sadly, some information is vague, therefore, it cannot be removed. For example, a photo of one’s face. That is why, in 2021, Facebook removed its powerful face recognition tool due to concerns regarding privacy violation, stalking, and surveillance.

Facebook’s decision to remove such a controversial function was celebrated by some people, stating that the company takes their user privacy seriously. In contrast, critics stated that the move was made to avoid regulation.

Poren warned people to be careful about their input when using generative AI because the model is trained using input from users. And, sometimes users ask the chatbot very personal questions.

“There is a case where private information is leaked from the model, it was because the private information was mixed with the data used to generate search results,” he said.

Before the emergence of ChatGPT, Copilot was similar because it used source codes in Github to generate search results.

“Some Chinese programmers uploaded personal and sensitive information to Github,” Poren said. “The platform takes it as trained data; thus, Copilot generates those personal data.”