Accept the updated privacy & cookie policy

The Indiatimes.com Privacy Policy has been updated to align with the new data regulations in European Union. Please review and accept these changes below to continue using the website. We use cookies to ensure the best experience for you on our website.

One Million Hours Of YouTube Video Transcribed By OpenAI To Train GPT-4

The New York Times documented some of the ways businesses have coped with this. Unsurprisingly, it entails engaging in activities that fall under the misty grey area of AI copyright law.

IT Trending Desk Updated on Apr 08, 2024, 12:17 IST

Earlier this week, The Wall Street Journal reported that AI businesses were struggling to collect high-quality training data.?

More recently, The New York Times documented some of the ways businesses have coped with this. Unsurprisingly, it entails engaging in activities that fall under the misty grey area of AI copyright law.

What did Open AI do to train Chat GPT-4??

Open AI transcribed over a million hours of YouTube videos to train GPT-4 | Image: Pexels

The story begins with OpenAI, which, desperate for training data, purportedly built its Whisper audio transcription algorithm to overcome the challenge, transcribing over a million hours of YouTube videos to train GPT-4, its most powerful large language model.?

According to The New York Times, the corporation was aware of the legal issues but believed it was fair usage. The Times reports that OpenAI president Greg Brockman personally collected the videos used.

How did the company proceed??

It is also considering creating its own synthetic data | Image: Pexels

In an email to The Verge, OpenAI spokesperson Lindsay Held explained that the business curates "unique" datasets for each of its models to "help their understanding of the world" and maintain its global research competitiveness.?

Held noted that the company uses "numerous sources, including publicly available data and partnerships for non-public data," and that it is considering creating its own synthetic data.?

Why did the company decide to use YouTube videos?

According to the Times piece, the corporation ran out of valuable data in 2021 and considered transcribing YouTube videos, podcasts, and audiobooks after exhausting other options.?

The corporation ran out of valuable data in 2021 | Image: Pexels

By then, it had trained its models on data such as Github computer code, chess move databases, and Quizlet homework content.?

What do you think about this? Tell us in the comments.

For more trending stories, follow us on?Telegram.? ??

IT Trending Desk

We bring you news from the wonderland - where the mundane takes a backseat and the extraordinary reigns supreme! Dive into a kaleidoscope of viral oddities and offbeat trends that'll leave you scratching your head one moment and doubling over with laughter the next!

Visual Stories

9 tips to grow juicy watermelons in your terrace garden this summer

6 reasons to change your surgical mask every 4 hours during COVID-19

Alia Bhatt slays bridesmaid goals with boho-chic to black-tie glam in Spain

Ciri Is the new Witcher Everything you should know

6 easy ways to stay hydrated during COVID-19 isolation support your recovery at home

Virat Kohli Rajat Patidar more RCB players share what winning IPL title meant

Hina Khan’s wedding outfit details decoded there was a secret detail

Baba Vanga’s prediction Top 5 zodiac signs set to succeed in 2025

10 unique records broken in IPL 2025

RCB wins 9 adorable Virat Kohli–Anushka Sharma moments we cant stop loving