Detecting sponsored content in Youtube Videos

I’m back !

With the whole lockdown situation, I had a little more time to explore a space I’m in love with : Machine Learning.

I feel like data management and intelligence are going to be at the heart of growth hacking/engineering so i was looking for an opportunity to dabble in it.

Growth engineering and machine learning are fields that require very similar skills:

Mostly Python developing
Scraping
Data processing
Analytic and critical thinking

After a few very practical tutorials on the specifics that I went through on kaggle.com/learn, I was able to get my hands on my first few projects.

The audio route

The idea was suggested by a friend who is really anti-ads and was getting super annoyed with sponsored content in his podcasts.

He told me about SponsorBlock, which is a crowdsourced ad skipper for Youtube.

When you download their extension, you’ll see a highlighted portion corresponding to sponsored content:

You can also label yourself content for other users to skip if the video isn’t already labeled.

What’s really cool about SponsorBlock is that their database is completely open !

You get access to every single labeled video, with the start and end time of the sponsors.

Their SQL database is open and updated often, so i used it to answer my next question :

Is it possible to detect sponsored content from Youtube videos ?

I didn’t know where to start, but I knew I had two paths:

Make a model learning on the audio
Make a model learning on the transcripts

I felt like audios would have much more data than simple text (music, cadence...), so I started down that road.

Without knowing anything about the subject, I searched around for audio classification algorithms, and found my way on the Panotti repo, which is based on a CNN (Convolutional Neural Network).

It was used to successfully detect 12 different guitar effects at a 99.7% accuracy, making 11 mistakes on 4000 testing examples !

In my case i only needed two classes (sponsor and not sponsor)

I followed the instructions, learned how to use youtube-dl to download podcasts highlights, and scraped Radiocentre’s commercials database because until then, I hadn’t learned about SponsorBlock.

The results were mixed:

When learning on a single podcast, the model showed 99% accuracy on the training set and 95% on the test set which means the model was able to detect commercial portions accurately on 95% of the episodes the network didn’t see.
When inputting different podcasts, the model wasn’t able to detect commercials with confidence, outputting less than 60% confidence on predictions

I figured that I simply needed to train the model on multiple different podcasts. Problem was it was getting super tough to handle the data : the dataset for 300 podcasts was about 50 gigs :(

SO, in order to get more “information” with less heavy data, I finally went with the transcripts !

The captions route

In the meantime I learnt about SponsorBlock which made everything easier.

I also looked around NLP (natural language processing) tutorials and found this excellent repo about sentiment analysis.

Once I went through it, I decided to go for the Transformer model which is based on BERT a NLP framework developed by Google.

To build the dataset, I still downloaded the videos with youtube-dl, except this time I fetched the automatic captions when they were available, giving me about 80k examples. (took about 35 hours on 100 mb/s internet)

To make things reliable, I chose for my training ads longer than 10s and shorter than 5 minutes, while making sure to keep only the videos only had one ad.

To have a balanced dataset, I took equal parts sponsor and content from the videos. (if the ad was 3 min long, the content training example was also 3 min long)

You can find the dataset right here : https://www.kaggle.com/anaselmhamdi/sponsor-block-subtitles-80k

The model yielded 93.79% testing accuracy !

You can find the notebook I ran over here : https://www.kaggle.com/anaselmhamdi/transformers-sponsorblock/ (It’s really raw and undocumented though)

Working with the model

The next step was to try and label automatically a random video with labels to see what would come out of it.

So here’s what popped up in my recommendations : https://www.youtube.com/watch?v=MlOPPuNv4Ec, a video by Linus Tech Tips.

I took my machine for a spring and you can find the raw results here or on this spreadsheet.

It detected correctly the two sponsored segments (Ting segments) but labelled content as sponsor with a high confidence level.

A few remarks about the mistakes:

The mistakenly labelled mistakes all contain upper case letters. I left them there because i thought they would indicate a brand, but in the captions, names, towns and such appear more frequently.
Some of them also contained brackets which is probably biasing the process since [Music] and [Applause] are often in sponsored content.
A conversation about MacBooks was flagged as sponsored content
I chunked the captions very roughly in 10s parts for the labelled video inference. I feel like I need to do a finer set of chunks and cross the results to have better results.

I feel like there's a huge margin for improvement, because of these few improvements, and more macro improvements such as :

Extending the dataset to more videos (I only had a subset of them)
Recalculating the BERT model parameters and fine tuning the neural network
Implementing feedback from experimental data scientist which i’m absolutely not

It’s really telling about how the field has advanced that somebody like me who is a complete newbie could make something somewhat functioning.

All the data and the works I found were really well documented, and easy to pickup.

The toughest part was to actually deploy the model as a service, since I’m not used to systems running on a lot of resources.

I actually deployed the API but its not stable, i'll update the post when its better !

‍

See ya next post,

Anas

Detecting sponsored content in Youtube Videos

The audio route

The captions route

Working with the model

Get my posts in your inbox