According to one estimate, OpenAI may have utilised over a million hours of transcribed data from YouTube videos to train its latest artificial intelligence (AI) model, GPT-4. OpenAI reportedly developed a tool called Whisper specifically for this purpose. The report suggests there might have been internal discussions at OpenAI about whether this practice violated YouTube’s terms of service. It goes on to say that the ChatGPT creator was compelled to get data from YouTube after exhausting its entire supply of text-word resources to train its AI models. If accurate, the claim might cause more headaches for the AI startup, which currently faces various lawsuits for utilising copyrighted data. Notably, research published last month revealed that the GPT Store had small chatbots violating company policies.
OpenAI officially released Whisper
According to The New York Times, after running out of sources with unique text terms to train its AI models, the business created Whisper, an automatic speech recognition tool, to transcribe YouTube videos and train its models using the data. OpenAI officially released Whisper in September 2022, claiming it was trained on 6,80,000 hours of “multilingual and multitask supervised data collected from the web.”
OpenAI staff questioned whether collecting YouTube data would violate the platform’s principles and land them in legal trouble. Notably, Google restricts the use of videos for applications not part of the platform. The corporation eventually carried out the plan and transcribed over a million hours of YouTube videos fed into GPT-4. Furthermore, the NYT investigation claims that OpenAI President Greg Brockman was intimately involved and physically assisted in data collection from films.
Open AI’s Take on the matter
OpenAI has not directly confirmed or denied the allegations. A spokesperson stated they use “unique” datasets for each model and obtain data through various sources, including public data and partnerships. YouTube’s terms of service prohibit unauthorised scraping or downloading of content
Using content without creators’ permission raises copyright concerns. Ethical questions exist about using potentially biassed or misleading data from YouTube for AI training. This incident could lead to stricter data usage policies for AI research. There could be increased scrutiny of the methods used to train large language models. Collaboration between AI researchers and content platforms like YouTube might be necessary to ensure ethical data practices.