February 6, 2021

Detecting sponsored content in Youtube Videos

adblock

I’m back !

With the whole lockdown situation, I had a little more time to explore a space I’m in love with : Machine Learning.

I feel like data management and intelligence are going to be at the heart of growth hacking/engineering so i was looking for an opportunity to dabble in it.

Growth engineering and machine learning are fields that require very similar skills:

After a few very practical tutorials on the specifics that I went through on kaggle.com/learn, I was able to get my hands on my first few projects.

The audio route

The idea was suggested by a friend who is really anti-ads and was getting super annoyed with sponsored content in his podcasts.

He told me about SponsorBlock, which is a crowdsourced ad skipper for Youtube.

When you download their extension, you’ll see a highlighted portion corresponding to sponsored content:

labelled video

You can also label yourself content for other users to skip if the video isn’t already labeled.

What’s really cool about SponsorBlock is that their database is completely open !

You get access to every single labeled video, with the start and end time of the sponsors.

Their SQL database is open and updated often, so i used it to answer my next question :

Is it possible to detect sponsored content from Youtube videos ?

I didn’t know where to start, but I knew I had two paths:

I felt like audios would have much more data than simple text (music, cadence...), so I started down that road.

Without knowing anything about the subject, I searched around for audio classification algorithms, and found my way on the Panotti repo, which is based on a CNN (Convolutional Neural Network).

It was used to successfully detect 12 different guitar effects at a 99.7% accuracy, making 11 mistakes on 4000 testing examples !

In my case i only needed two classes (sponsor and not sponsor)

I followed the instructions, learned how to use youtube-dl to download podcasts highlights, and scraped Radiocentre’s commercials database because until then, I hadn’t learned about SponsorBlock.

The results were mixed:

I figured that I simply needed to train the model on multiple different podcasts. Problem was it was getting super tough to handle the data : the dataset for 300 podcasts was about 50 gigs :(


SO, in order to get more “information” with less heavy data, I  finally went with the transcripts !

The captions route

In the meantime I learnt about SponsorBlock which made everything easier.

I also looked around NLP (natural language processing) tutorials and found this excellent repo about sentiment analysis.

Once I went through it, I decided to go for the Transformer model which is based on BERT a NLP framework developed by Google.

To build the dataset, I still downloaded the videos with youtube-dl, except this time I fetched the automatic captions when they were available, giving me about 80k examples. (took about 35 hours on 100 mb/s internet)

To make things reliable, I chose for my training ads longer than 10s and shorter than 5 minutes, while making sure to keep only the videos only had one ad.

To have a balanced dataset, I took equal parts sponsor and content from the videos. (if the ad was 3 min long, the content training example was also 3 min long)

You can find the dataset right here : https://www.kaggle.com/anaselmhamdi/sponsor-block-subtitles-80k

The model yielded 93.79% testing accuracy !

test accuracy

You can find the notebook I ran over here : https://www.kaggle.com/anaselmhamdi/transformers-sponsorblock/ (It’s really raw and undocumented though)



Working with the model

The next step was to try and label automatically a random video with labels to see what would come out of it.

So here’s what popped up in my recommendations : https://www.youtube.com/watch?v=MlOPPuNv4Ec, a video by Linus Tech Tips.

I took my machine for a spring and you can find the raw results here or on this spreadsheet.

It detected correctly the two sponsored segments (Ting segments) but labelled content as sponsor with a high confidence level.

A few remarks about the mistakes:

I feel like there's a huge margin for improvement, because of these few improvements, and more macro improvements such as :

It’s really telling about how the field has advanced that somebody like me who is a complete newbie could make something somewhat functioning.

All the data and the works I found were really well documented, and easy to pickup.

The toughest part was to actually deploy the model as a service, since I’m not used to systems running on a lot of resources.

I actually deployed the API but its not stable, i'll update the post when its better !

See ya next post,

Anas


Get my posts in your inbox

I promise I'll never send any spam your way

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.