Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

Spread the love

Openai accused by a lot Artificial intelligence training parties to copyright content without permission. Now new paper Through the Amnesty International Observatory, AI assumed a serious accusation that the company has increasingly relied on non -public books that it had not authorized to train the most advanced artificial intelligence models.

Artificial intelligence models are basically complex prediction engines. Train on a lot of data – books, movies, TV programs, etc. – they learn the simple patterns and methods of extrapolation. When a “writing” model writes an article on a Greek tragedy or “drawing” the images of a observation-unseen, it simply pulls from its wide knowledge to approximate. It does not reach anything new.

Although a number of artificial intelligence laboratories, including Openai, have begun to embrace the data created by artificial intelligence to train artificial intelligence during the exhaustion of the real world sources (mainly on the public internet), a few of the realistic data were completely avoided. This is probably because training in purely artificial data comes with risks, such as exacerbation of the model performance.

The new paper, from the project to disclose artificial intelligence, a non-profit organization that participated in 2024 by the media pole Tim Aureli and Economy Ilan Strauss, concludes that Obaya is likely to train the GPT-4O model on Paywalled Books of O’Railly Media. (O’reillly is the CEO of O’Railly Media.)

In ChatGPT, GPT-4O is the default model. The paper says that O’Railly does not have a license agreement with Openai.

“GPT-4O, the most modern and capable Openai model, shows a strong recognition of the content of the O’Railly Bookd Paywalled book … compared to the former Openai model GPT-3.5 Turbo”, the authors participating in the paper wrote. “On the contrary, GPT-3.5 Turbo shows a larger relative recognition of the audience’s O’Reilly book samples.”

The paper used a method called cancellationIt was first presented in an academic study in 2024, designed to discover copyright content in language models training data. Also known as the “organic reasoning attack”, the method tests whether the model can distinguish the texts that are reliably composed of man from the versions created from artificial intelligence from the same text. If he can, it indicates that the model may have prior knowledge of his training data.

The authors participating in the paper-O’Railly, Strauss, Ai Sruly Rosenblat-said they investigated GPT-4O, GPT-3.5 Turbo, and Openai Models of Operelly Media Books before and after the dates of the training cutting. They used 13,962 excerpts from the paragraph of 34 books to estimate the possibility of inserting a specific excerpt in the model training set.

According to the results of the paper, the GPT-4O is “recognized” much more than the O’Reillly book content than Openai’s older models, specifically GPT-3.5 Turbo. The authors said that even after calculating potential confusing factors, such as improvements in the ability of newer models to know if the text has been composed of man.

“GPT-4O [likely] The authors wrote the participating authors, and they also admit, as well as in advance, of many of the non -general Araili books that were published before the date of cutting the training.

It is not a smoking pistol, as the participants are keen to refer to it. They acknowledge that their experimental method is not guaranteed and that Openai may collect book excerpts restricted by users who copy and paste it into Chatgpt.

An enemy of water, the authors participating did not evaluate the latest set of models from Openai, which include GPT-4.5 and “Thinking” models such as O3-MINI and O1. These models may not have been trained in the data of the restricted walls of O’Reilly Book or have been trained in an amount less than GPT-4O.

However, it is not a secret that Openai, who has called for more clarified restrictions on the development of models using copyright -protected data, is looking for high -quality training data for some time. The company has gone to the extent Journalists rent to help control the outputs of their models. This is a trend through the broader industry: Artificial intelligence companies recruit experts in areas such as science and physics These experts actually make their knowledge of artificial intelligence systems.

It should be noted that Openai pays at least some training data. The company has licensing deals with news publishers, social networks, media libraries and others. Openai also provides subscription cancellation mechanisms- And if it is incomplete – Which allows copyright owners to suspend their content who prefer not to use the company for training purposes.

However, since Openai’s battles are many claims for training data practices and the treatment of copyright law in American courts, the O’reillly paper is not the most tempting.

Openai did not respond to a request for comment.

Source link

By admin