What’s the average rental price of a 2-bed in Dubai these days? (PART 2)

6 min readSep 27, 2020

Scrape the detailed information of each property listed on Property Finder

Quick recap: after generating the links to the search pages for properties listed for rent in the UAE, we are now going to scrape more detailed information about each property so that, in the end, we can come up with the average price for different kinds of properties.

The goal is for us to collect data about these properties and save them into a Pandas dataframe for analysis. A dataframe is a 2-dimensional table in Pandas, Python’s most popular module for data storing and analysis.

Here is how the table of properties will look like:

In essence, we will pass on the list of search pages and will extract the information about each property and put it in the dataframe.

To start with, the relevant packages are imported and I added some code that you can edit and uncomment in order to declare the path to your local folder.

# Packages we need in order to perform the scraping
import grequests
from grequests import get
from bs4 import BeautifulSoup
import lxml
import pandas as pd
import numpy as np
from time import sleep
from random import randint
import os
from datetime import datetime

# EDIT THIS TO YOUR LOCAL FOLDER
#os.chdir('C:\\path\\to\\the\\projects\\folder')

today = datetime.today().strftime('%m-%d-%Y')

One function will be enough to run the whole scraping process, I called it rentals. At an overall level, what we do is to pass the list of properties (alist) to this function and we output a dataframe called results. The Jupyter notebook is saved on GitHub.

In more detail, these are the operations defined in the rentals function:

1. Declare the results dataframe: results = pd.DataFrame()

2. Send requests to the Property Finder website, 10 at a time:

rs = (grequests.get(url) for url in alist)
responses = grequests.imap(rs, size = 10)

3. Create a for loop to go through each batch of responses, parse the content with Beautifulsoup and find all the tags identifying the listed properties:

for response in responses:

        soup = BeautifulSoup(response.text, 'lxml')
        
        div_tag = soup.find_all('div', {'class':'card-list__item'})

In order to identify the property tag, we go on the Propertyfinder page -> right-click -> Inspect -> Click on the first button on the left side of the down tab which opens, and then we select the property tab. We copy the class of the div tag, namely ‘card-list__item’, and then paste it in the script. You can check, there are 25 such card-list items.

4. Next, we iterate through each of these cards to extract the data from all of them. A very important aspect that took me one day to figure out is how to deal with missing data. My solution was to use a try-except scheme for each variable extracted and to assign the None value in case the variable tag does not exist or is empty.

For instance, the script is going to look for the h2 tag with of class card__title card__title-link. If the tag does exist and is not empty, then, the script extracts the text attribute and deletes any extra empty spaces with the strip() function. If the tag does not exist, then it assigns value None to the title variable. The process is similar for all the other data points.

for div in div_tag:

            try:
                title = div.find('h2', {'class':'card__title card__title-link'}).text.strip()
            except:
                title = None

            try:
                ttype = div.find('p', {'class':'card__property-amenity card__property-amenity--property-type'}).text.strip()
            except:
                ttype = None

            try:
                bedrooms = div.find('p', {'class':'card__property-amenity card__property-amenity--bedrooms'}).text.strip()
            except:
                bedrooms = None
                
            try:
                bathrooms = div.find('p', {'class':'card__property-amenity--bathrooms'}).text.strip()
            except:
                bathrooms = None

            try:
                area = div.find('p', {'class':'card__property-amenity card__property-amenity--area'}).text.strip()
            except:
                area = None                
                
            try:
                price = div.find('span', {'class':'card__price-value'}).text.strip()
            except:
                price = None
                
            try:
                frequency = div.find('p', {'class':'card__property-amenity--bathrooms'}).text.strip()
            except:
                frequency = None

            try:
                location = div.find('span', {'class':'card__location-text'}).text.strip()
            except:
                location = None                

            try:
                link = div.find('a', {'class':'card card--clickable'})['href']
                link = 'www.propertyfinder.ae' + link
            except:
                link = None

A special mention for the last variable, which is the direct link to the listed property page. Note that we extract the ‘href’ attribute rather than the text and, because we may want to use these links in order to extract some more data present on the individual-property pages, we concatenate the relative link of each property with the base website address, www.propertyfinder.ae.

5. The next step is to declare a temporary dataframe, temp_df, to store the output from each iteration by specifying the variables to store and names of the columns in the dataframe. Finally, we append the output of each iteration to the dataframe, called results.

temp_df = pd.DataFrame([[title, ttype, bedrooms, bathrooms, area, price, location, link]], columns=['title', 'type', 'bedrooms', 'bathrooms', 'area', 'price', 'location', 'link'])
results = results.append(temp_df, sort=False).reset_index(drop=True)

6. Scraping can be tricky. Different websites have different limitations in terms of the number of requests they accept. It will be your job to find the best way to deal with it. For the purpose of this exercise, I am using a lag of 2–3 seconds between requests, by using a randomly generated series of 2 and 3 seconds and telling the system to add that lag between iterations. The important thing to remember is that if you notice that your requests do not pass through it may be because the website blocked you for making too many requests in a very short period of time.

In this script, one can try and reduce the number of pages you ask every time — for instance, change 10 to 5 in the line above: responses = grequests.imap(rs, size = 5). In addition, you may want to increase the time between requests i.e. change the below to something like sleep(randint(4,12)) or bigger numbers. However, this will increase the time needed for the scraping.

sleep(randint(2,3))

We can now prepare for the actual processing. First, we load the list of search page links that we generated in the previous exercise. Note that the name of the .csv file is written in such a way to automate the daily running of the script without requiring manually changing the file name. To load the data, we use the read_csv function from the Pandas library.

#LOAD THE DAILY LIST OF LINKS
pf = pd.read_csv('AllPFPages-' + today + '.csv', sep=',')

#SHOW THE DATAFRAME FOR A QUICK INSPECTION
pf

Next, we convert the ‘Links’ column of the pf dataframe into a Python list so that we can pass it on to the rentals function, defined above, by using the function .tolist(). I like to check things as I go, so you’ll notice that we also print the number of search pages on the list. It shows 1,716 when ran on the Sep 21st data.

#LOAD THE LINKS INTO LIST
links = pf['Links'].tolist()

#PRINT THE TOTAL NUMBER OF PAGES TO SCRAPE
print(len(links))

The most important step in which we run the scraper arrived. There are a few different ways to time the execution. I preferred to create start and end variables and print them to have a log of that.

The scraper itself is run by passing the list of links to the rentals function and saving the output into the dataframe named data.

data = rentals(links)

Finally, we save the dataframe to a local CSV file for further analysis. Notice the dynamic naming of the files to enable automatic running every day, without the need to manually change the file names.

data.to_csv('PF_Rentals-' + today + '.csv', sep=',')

The output is saved on Github as a CSV file (11 Mb). I proudly present to you the data about 42,768 properties. The scraping took 38 minutes on my laptop and 12 minutes on Google Colab. I’ll write a short post on how to set up Google Colab and what are the tricks to make it go smoothly.

Stay tuned as we fast approach the moment to reveal the average rental price for a 2 bedroom in Dubai.

What’s the average rental price of a 2-bed in Dubai these days? (PART 2)

Written by Silviu Matei