Earlier this week, The Wall Street Journal reported that AI businesses were struggling to collect high-quality training data.?
More recently, The New York Times documented some of the ways businesses have coped with this. Unsurprisingly, it entails engaging in activities that fall under the misty grey area of AI copyright law.
The story begins with OpenAI, which, desperate for training data, purportedly built its Whisper audio transcription algorithm to overcome the challenge, transcribing over a million hours of YouTube videos to train GPT-4, its most powerful large language model.?
According to The New York Times, the corporation was aware of the legal issues but believed it was fair usage. The Times reports that OpenAI president Greg Brockman personally collected the videos used.
In an email to The Verge, OpenAI spokesperson Lindsay Held explained that the business curates "unique" datasets for each of its models to "help their understanding of the world" and maintain its global research competitiveness.?
Held noted that the company uses "numerous sources, including publicly available data and partnerships for non-public data," and that it is considering creating its own synthetic data.?
According to the Times piece, the corporation ran out of valuable data in 2021 and considered transcribing YouTube videos, podcasts, and audiobooks after exhausting other options.?
By then, it had trained its models on data such as Github computer code, chess move databases, and Quizlet homework content.?
What do you think about this? Tell us in the comments.
For more trending stories, follow us on?Telegram.? ??