AI vs Ethics: Did OpenAI Overreach By Mining YouTube Data for GPT-4?

April 9, 2024

208

According to one estimate, OpenAI may have utilised over a million hours of transcribed data from YouTube videos to train its latest artificial intelligence (AI) model, GPT-4. OpenAI reportedly developed a tool called Whisper specifically for this purpose. The report suggests there might have been internal discussions at OpenAI about whether this practice violated YouTube’s terms of service. It goes on to say that the ChatGPT creator was compelled to get data from YouTube after exhausting its entire supply of text-word resources to train its AI models. If accurate, the claim might cause more headaches for the AI startup, which currently faces various lawsuits for utilising copyrighted data. Notably, research published last month revealed that the GPT Store had small chatbots violating company policies.

OpenAI officially released Whisper

According to The New York Times, after running out of sources with unique text terms to train its AI models, the business created Whisper, an automatic speech recognition tool, to transcribe YouTube videos and train its models using the data. OpenAI officially released Whisper in September 2022, claiming it was trained on 6,80,000 hours of “multilingual and multitask supervised data collected from the web.”

OpenAI staff questioned whether collecting YouTube data would violate the platform’s principles and land them in legal trouble. Notably, Google restricts the use of videos for applications not part of the platform. The corporation eventually carried out the plan and transcribed over a million hours of YouTube videos fed into GPT-4. Furthermore, the NYT investigation claims that OpenAI President Greg Brockman was intimately involved and physically assisted in data collection from films.

Open AI’s Take on the matter

OpenAI has not directly confirmed or denied the allegations. A spokesperson stated they use “unique” datasets for each model and obtain data through various sources, including public data and partnerships. YouTube’s terms of service prohibit unauthorised scraping or downloading of content

Using content without creators’ permission raises copyright concerns. Ethical questions exist about using potentially biassed or misleading data from YouTube for AI training. This incident could lead to stricter data usage policies for AI research. There could be increased scrutiny of the methods used to train large language models. Collaboration between AI researchers and content platforms like YouTube might be necessary to ensure ethical data practices.

Author

TAM Bureau

AI vs Ethics: Did OpenAI Overreach By Mining YouTube Data for GPT-4?

OpenAI officially released Whisper

Open AI’s Take on the matter

Author

Acer, Intel and Infinity Learn Launch NEET Ready Laptop

Indian Organisations Experience Ten Times More AI-related Data Policy Violations: Netskope

Gaurav Gour Joins Persistent Systems as Head of ASEAN Sales

LEAVE A REPLY Cancel reply

Most Popular

Acer, Intel and Infinity Learn Launch NEET Ready Laptop

Indian Organisations Experience Ten Times More AI-related Data Policy Violations: Netskope

Gaurav Gour Joins Persistent Systems as Head of ASEAN Sales

WebEngage Powers Kaspersky’s Shift Towards More Personalised B2B Customer Engagement

Recent Comments

EDITOR PICKS

Manoj Sahay Appointed Country Head for Logitech India

Salesforce Announces Great India Sales and Marketing Summit

AI Will Transform Real Estate, But Won’t Replace Human Touch: Sunil Mishra, ANAROCK Group

POPULAR POSTS

Top Ten Google Gemini AI Photo Editing Prompts

Top 5 Google Gemini AI Photo Editing Prompts for Navratri

Blackberry Edge Stylus 2024: Price, Release Date, and Full Specifications

MOST COMMENTED

PDRL Launches BhuMeet: Aggregator Platform to Help Farmers Connect with Drone Service Providers

Rashi Peripherals Gets Top Value-Added Distributor of the Year Award

IDEMIA Collaborates on Post-Quantum Cryptography with IIT Hyderabad

Subscribe to our Newsletter

ABOUT US

FOLLOW US

CONTACT US