In-Short
- Study by AI Disclosures Project suggests OpenAI’s GPT-4o may use copyrighted O’Reilly Media books for training.
- GPT-4o recognized paywalled content with 82% accuracy, while GPT-3.5 Turbo showed less recognition.
- Concerns raised over potential decline in internet content quality due to uncompensated use of training data.
- Emerging market for legally obtained training data could address issues of data provenance and remuneration.
Summary of the Study on OpenAI’s Training Data
A recent investigation by the AI Disclosures Project has brought to light concerns regarding the training data used by OpenAI for its language models. The study focused on the GPT-4o model’s ability to recognize content from copyrighted books by O’Reilly Media, suggesting that OpenAI might be using such data without proper authorization.
The research, spearheaded by Tim O’Reilly and Ilan Strauss, applied the DE-COP membership inference attack method to assess whether OpenAI’s models could identify original texts versus their paraphrased versions. The findings revealed that GPT-4o had a significant ability to recognize paywalled content, with an Area Under the Receiver Operating Characteristic (AUROC) score of 82%, indicating a potential access violation through databases like LibGen.
While GPT-4o showed high recognition of non-public content, its predecessor, GPT-3.5 Turbo, and a smaller model, GPT-4o Mini, did not demonstrate the same level of awareness. This discrepancy raises questions about the sources of training data and the ethical implications of using copyrighted material without compensation.
The report underscores the broader issue of copyright infringement in AI training processes and the potential negative impact on the diversity and quality of internet content. It advocates for increased corporate transparency and the establishment of commercial markets for training data licensing.
With the EU AI Act’s disclosure requirements on the horizon, there is hope for a more accountable AI industry. Meanwhile, companies like Defined.ai are leading the way in creating a market for legally sourced training data, ensuring consent and privacy are respected.
In conclusion, the study presents evidence that OpenAI’s GPT-4o likely utilized proprietary O’Reilly Media books in its training, highlighting the need for more stringent data usage policies in AI development.
For more detailed insights, read the full article at the original source.