September 21, 2023

Go vs Python for web scraping : what's better ?

As an absolute Go noob, I keep FOMOing over this fairly new language compared to the ones I usually use, JS and Python.

If the rumors are true, Go is becoming the go to language (no pun intended) to ship stuff fast, efficiently and safely.

I thought I'd give it a run and give you my feedback on my fave use case: web scraping.

I'll compare execution times and memory profiles to see if Go really is worth it for me.

Scraping a list of the NBA All Stars

Nothing too fancy today, I'm going to be scraping the NBA's best players ever off of this wikipedia page.

We'll get the list here and then scrape the players individually, scraping their basic data:

Now that we have our scraping plan, i'm going to be scraping the 250+ profiles two ways for the two languages:

I'm going to be using a proxy for every run to make sure our requests go through.

With Python

No concurrency

The single thread code is fairly straight forward and concise, which is what I love the most about Python.

When you start running this code, it'll start outputting the names and info about the scraped player, it runs in about 80 seconds and it was very easy to write.

With concurrency

The recommended way to run concurrent requests in Python is to use the aiohttp library. It's a more verbose and less intuitive than the previous code unfortunately. I don't know why but it never feels right when I write it:

It looks like went about twice as fast, not as good as I expected but hey, still a good timesaver.

I used the tracemalloc library to figure out what was the peak memory consumed, which was about 66 Mb:

Let's now switch to Go and see if we make any significant improvements.

Getting started with Go

Coming from Python, it was surprisingly easy. Here are the resources I used:

Let's Go scraping

No concurrency

Our goal today is going to scrape all of the NBA All stars ever and scraping their individual height weight

The equivalent of the Python requests library is Colly.

It uses callbacks a la JS to do operations on the request object and parse the HTML for specific information.

To get started i simply needed to run

go mod init nba_all_stars_scraper
go get -u github.com/gocolly/colly/...

The objects we use actually do the parsing are called collectors, they're super useful: they can be cloned to replicate a specific behavior (running behind proxies here for instance) and they hold the scraping logic.

Conceptually it was very easy to get into.

To get started by running in a single thread, my code looked like this:

After a quick go run . :

The scraper ran way faster than the Python on, about 5 times as fast.

Let's push it further and see what improvements we can make with concurrency.

With concurrency

No changes of libraries here, I just needed to be introduced to goroutines and wait groups.

Goroutines are the go way of running asynchronous functions (coroutines in Python) and wait groups allows us to sync goroutines.

They're explained better here on gobyexample.

In our case, we just needed to increment the wait group for each goroutine created on profil scraping and wait for all of them to be done. (The defer key allows code to be run after a function has exited).

So again here, the implementation was fairly straightforward:

This code ran in 7.7s:

About 2 times faster than previously than the non concurrent version and 5 times faster than Python as well  !

What about memory ?

I used pprof (See their github) to profile the memory as explained in this article.

After adding the few lines of code necessary to profile the memory used by my program, i ran it to generate the profile and ran pprof:

This opened a browser window with this SVG showing how the memory was used and by what.

Pretty cool, we can see that pprof itself actually took a good chunk of it (which tracemalloc does also in Python).

All in all running the program took 5.5Mb worth of memory, more than 10 times less than Python.

Conclusion

I think I have some legacy Python refactoring to do ...

You can access the code for each language on this repo !

BONUS

If you made it through here, I also tried Rust and the results weren't even close.

Rust is fun to learn but it's way more complicated than Go or Python to get 'small' stuff done.

Everything needs to be explicitly typed, and you need to know what you're doing with memory.

Mechanically, it can get pretty verbose and for non CPU bounds ops like we do here, it doesn't get a chance to shine.

This piece of code ran in 130s, 50% slower than Python when using no concurrency:

I got lazy with the parsing i know

However, when using concurrency, which was a pain to setup, it performed a bit better than Python at 39s, about 10% faster (for 10 times the pain):

Get my posts in your inbox

I promise I'll never send any spam your way

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.