The library that I am discussing is meant for testing a website but it COULD also be used for scraping. It is YOUR responsibility that you scrape websites responsibly, not mine. scraping static websites is easy. to protect themselves against scrapers (and because of the load on the servers) many websites implement javascript to load data asynchronosly when a user requests a website. in such a situation the client needs to wait before all the javascript is executed before all html is generated. in these cases you cannot use libraries like urllib and requests to retreive the html data.

Enter selenium

Fortunately, there is a nice python library called selenium that emulates a browser for you which will still allow you to automate the collection of online data. In it's origin it is a java library but you can install the python bindings via pip. Selenium will use firefox as it's default browser, so make sure it's installed before installing selenium.

$ sudo pip install selenium

Let's do a hello world example. We will get selenium to open google.com and make it return the browser windows title. open up a python terminal and run the following script;

from selenium import webdriver  
browser = webdriver.Firefox()  
browser.get('http://www.google.com')  
print browser.title  
browser.quit()

You should see a firefox window open and close. Because we have an actual browser window we get it along with a full javascript interpreter. If the page has javascript that needs to run, you can have python wait for it to finish;

from selenium import webdriver  
import time  
browser = webdriver.Firefox()  
browser.get('http://www.google.com')  
time.sleep(1)  
print browser.title  
browser.quit() 
full javascript control

Selenium gives you a lot of control over the browser. We can have a browser run and wait untill any javascript that needs to be run is loaded. We can even run any javascript we want from python in the browser;

from selenium import webdriver  
browser = webdriver.Firefox()  
browser.get('http://www.google.com')  
browser.execute_script("return document.cookie")  
browser.execute_script("return navigator.userAgent")  
browser.quit()

You are able to have anything returned to python that javascript can access. You could even cause click events or query through css selectors with this library

from selenium import webdriver  
browser = webdriver.Firefox()  
browser.get('http://www.google.com')  
input =  browser.find_element_by_css_selector('input[type="text"]')
input.send_keys('koaning.com')
button =  browser.find_element_by_css_selector('button')
button.click()
browser.quit()

Instead of using the click event on the button you could achieve a similar thing by sending keyboard information

from selenium import webdriver  
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()  
browser.get('http://www.google.com')  
input =  browser.find_element_by_css_selector('input[type="text"]')
input.send_keys('koaning.com')
inputElement.send_keys(Keys.ENTER)
browser.quit()

Notice that you could also use inputElement.submit() to submit text to the inputelement instead of passing it a Keys.ENTER object.

Automation

If you want to automate this approach you will most likely want to outsource the scraping to a server (the javascript can take some time). Initially you might notice that this script doesn't always work run when you run it through ssh on another machine. This is because selenium needs a window to operate from. It cannot just run the entire browser from a console. To get selenium to work we need to fake a window, this can be done with pyvirtualdisplay.

You can install it via;

$ sudo pip install pyvirtualdisplay

If you log into a server through ssh then the following python script will work:

from pyvirtualdisplay import Display  
from selenium import webdriver  
display = Display(visible=0, size=(800, 600))  
display.start()  
browser = webdriver.Firefox()  
browser.get('http://www.google.com')  
print browser.title  
browser.quit()  
display.stop()

This can be useful for some small scrape jobs, be nice to the internet though. Both the client and the server need to do extra work through this trick. If you're gonna scrape, scrape responsibly.

{% endfilter %}