September 11, 2023

An AWS hack for cheap and reliable proxies

I'm a Lambda maximalist

Hey gang,

Following up on my last article about proxy scraping I thought I'd share a cool hack I found if you run your scraping workflows through lambdas, which I often find myself doing.

To give you guys a concrete use case, I'll piggy back off the Scraping with Tor article where I scrape Linkedin job offers.

Tor is pretty tough to use on lambdas and Linkedin blocks a ton of free public proxies.

This gives us the perfect situation for our lambda-powered proxy hack.

To follow along you’re gonna need Serverless installed and configured with your AWS account. I explain how to do so in this video !

Scraping Linkedin job offers

Our goas here is to be able to scrape all the new software engineering offers in France daily.

Here's the linkedin URL I was working with, the page looks fairly simple:

In a world where Linkedin doesn't block our requests, the process would be fairly simple:

The issue with this process is that when we're going to loop over the job offers, Linkedin will block our IP.

To leverage the hack what we need to do is actually create two cloud functions, one that would be responsible for launching the whole process (including scraping the top level job offers URLs), and another one that will simply scrape the job offer information from a job URL.

So before implementing our hack, this is what our code looks like (the JOB_SCRAPER_ARN variable is the resource name of our worker function):

Our serverless configuration, which will not change throughout the tutorial looks like this:

I didn't setup a trigger for our worker since it's only meant to be called by the lambda client in the main function.

You can already deploy the functions at this stage but you'l see that you'll run into errors really quickly.

Let's now talk about the meat of the article: the hack.

A cold start

To set the context for the hack, I have to go a bit further into what lambdas actually are.

AWS defines it as:

A serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers.

Basically, write code and deploy a lambda => you can now use that code everywhere.

Under the hood, you can think of lambdas as a pool of unused VMs that are used when they're called.

When your lambda is called for the first time, there's a process starting your ephemeral lambda instance, and your code is ran.

That first call takes significantly more times than the subsequent calls, which is why it's called a cold start.

According to AWS:

Cold starts typically occur in under 1% of invocations. The duration of a cold start varies from under 100 ms to over 1 second.

After your made your calls and the function is idle for a bit, the instance destroys itself, freeing the space for another use.

Usually cold starts are looked as a drawback of using serverless functions but in our case, we're going to use it as a feature.

Indeed, when a function cold starts, it redeploys itself on another pooled instance, which as a consequence, changes the IP address its code is ran from.

So if there's a scalable way to force a cold start, there's a way to change functioning IPs freely.

Forcing cold starts

Forcing cold starts is actually super simple: any update to a lambda function will force it to cold start on its next run.

I chose to update the function that needs to run environment with a random id.

To use it in a synchronous way as we'll do for our scraping, we also need to wait for our function to finish updating using an AWS waiter:

Finally, I'll simply add an if/else statement to force a cold start if Linkedin blocked our call and voila:

I also added a ip to the output for good measure and to show you at what frequency we actually change our IP.

Forcing a cold start is free and as you already know, lambda computing is dirt cheap so it's perfect for small use cases like this.

You can clone my public repo for yourself on Github !

Get my posts in your inbox

I promise I'll never send any spam your way

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.