What’s the average rental price of a 2-bed in Dubai these days?

11 min readSep 21, 2020

Find it like a data scientist ( PART 1)

Are you a tenant who wants to know how is your rent compared to the market average? Or maybe you are looking to change your abode?

Here’s my solution for anyone with some level of data science and programming skills. If you want to get the average price, you first need all the prices in the market or — at a minimum — the prices on one of the major real estate agencies. And since I live in Dubai, here it is. I’ll take you through on how to scrape the real estate prices for properties listed for rent on Property Finder.

I will do this in 3 separate steps:

Generate the list of pages on Property Finder listing properties for rent
Scrape the detailed information of each property listed on all these pages
Run a few statistics to find out the average price for each type of property

Let us start with the first step, namely, getting the links of all the pages listing the properties to rent across the Emirates.

For this, go on the landing page of Property Finder and — without making any choice — click on FIND.

By default, it shows the properties listed for rent in Dubai — note the Dubai option above the search tab after the Find button is clicked, option which gets selected by default.

Right below the search tab, we get the total number of items found: 30,927. Keep in mind that this number may change every few seconds so do not be surprised if the same search yields slightly different outputs. Scrolling down the page, we notice that there are 25 properties listed on the search page. This means, therefore, that — for Dubai — there will be around 30,927 / 25 = 1,237 pages, from which we will extract the information about each property being listed.

Let’s build an intuition of what web scraping means. If we’ve got 1,000+ pages with information that we need to analyse, it is not practical to visit all of them and copy-paste selected bits into an excel file for analysis. First, it takes too long — and, by the time we finish, the website may have changed significantly. Second, copy-pasting will also include pictures, various other text formatting elements such as tables, etc. which will make it really difficult to work with the data.

The question is how can we do this in an automatic and fast way so that, for instance, we can check every day if the average price goes up or down. The process of extracting the information from websites and saving or processing it for further analysis is called web scraping.

There are several modules in Python to perform web scraping. For a detailed review, see here.

I will use the following setup for this first exercise:

Jupyter Notebooks with Python 3.6 will be used for development, see this post for how to set it up.
Make sure that you also have these packages installed or install them with pip install in your environment: grequests bs4 lxml pandas numpy

Basically, we use grequests to make asynchronous requests to the website, beautifulsoup and lxml to parse the pages and pandas and numpy for data manipulation.

This is the time to ask ourselves what do we actually need and in which format?

Assuming the first two properties are the ones in the image above, let’s say that we want the output to look like this:

In order to generate all the pages to visit for data extraction, let’s have a look at the address of the first page for the search in Dubai:

https://www.propertyfinder.ae/en/search?c=2&l=1&ob=mr&page=1&rp=y&t=1

It’s easily noticeable that it is an expression, in which, c=2 is a filter to only show the properties for rent. Actually, if we replace 2 with 1, it will only show the properties for sale. ‘l’ is a filter for the city, with l=1 being the value for Dubai. If we replace 1 with 2, only properties in Umm Al Quwain will be listed. The ‘page’ just stands for the search page number.

To scrape the information from all the search pages, for all the cities, we first need to find out how many search pages are there for each city.

We’ve already seen how we can get that: by dividing the number of properties by the number of properties shown on each page (25). We’ve seen that — for Dubai — there are around 1,237 search pages listing the 30,927 properties. As we also know that the code for Dubai is 1, we can write a code to generate all the links for Dubai, keeping l=1 and increasing the value of the page by one every time up to 1,237. We can then do the same for all the other cities in the Emirates. Note that, for the cities (the ‘l’ variable), we will only use values from 1 to 8, to represent Dubai (1), Umm Al Quwain (2), Ras Al Khaimah (3), Sharjah (4), Ajman (5), Abu Dhabi (6), Fujairah (7), and Al Ain (8).

So let’s first scrape the number of properties in each city. The full code is uploaded on GitHub, I will just provide explanations below.

First of all, we need to import the relevant packages, set the working directory, and create a variable for the date when the script is run.

Next, we define some functions in Python to do the work for us. I know functions are quite an advanced feature of Python but we need them so that the code runs fast. Don’t be afraid, I’ll explain it step by step. At this point, just keep in mind that we first define the functions without processing any data, as if we are creating the tools that Python will use later.

To start with, we need a function to generate a list of all the first search pages for all 8 cities. I called this function ‘firstpage’.

################## FUNCTION 1: TO SCRAPE THE FIRST SEARCH PAGE FOR EACH CITY def firstpage():
    u = 'https://www.propertyfinder.ae/en/search?c=2&l={}&ob=mr&page=1&rp=y&t=1'
    city= [1, 2, 3, 4, 5, 6, 7, 8]
    url = []
    
    for i in city:
        temp_url= u.format(i)
        url.append(temp_url)
    print(url)
    return url

The function doesn't take any arguments— that's why the parentheses are empty. We initialise: a string variable (u); a list ‘city’, with values from 1 to 8 (for each city), and we declare ‘url’ as the output list. The u string is actually the base url. Please note that we replaced the city number with braces {} because we want to dynamically generate links, with those braces taking values from 1 to 8, one for each city.

Next, we create a for loop for each value i in the list ‘city’.

We ask Python to replace the braces in the u string with each value in city and we save each iteration in a temporary variable.

And then we append the output of each iteration to the url list. Finally, we print all the links just for a quick inspection and we output the url list.

The second function extracts the number of properties to rent in each city and saves the series in a Python list.

First, let’s examine the object that we want to extract information from the page (see image below). In order to do that, open the page in Chrome/Brave-> right click-> Inspect->Button in RED->left click on the number 30,975. Without going into the structure of an HTML file, we notice that the number is contained in a div block of a particular class ‘property-header__list-count property-header__list-count — new ge_resultsnumber text — size2 text — color1 text — normal’. That is how we will programmatically find the item on the page.

The function getnoofpages takes alist as argument. The alist will be the output of the previous function. That means we will feed the list of URLs from the first function into the second one, in order to visit all 8 pages and extract the number of properties for rent in each city. We also declare a list p which will hold the total number of properties listed in each city.

############# FUNCTION TO EXTRACT THE NUMBER OF PAGES AND SAVE IT IN A PYTHON LISTdef getnoofpages(alist):
    import re
    
    p = []
 
    req = (grequests.get(url) for url in alist)
    response = grequests.imap(req, size = 1)
    
    for r in response:        try:
            soup = BeautifulSoup(r.text, 'lxml')
            temp = soup.find('div', class_= 'property-header__list-count property-header__list-count--new ge_resultsnumber text--size2 text--color1 text--normal').text
            temp = re.findall('[\d]*',temp)
            temp = ''.join(temp)
            p.append(temp)            
        except:
            passreturn p

In detail:

We import the re module locally i.e. inside the function because we only use it there to extract the number from a string.

We make requests to the website for each page (url from the list).

Then, we store the output in response. Because we have a small list, we send just one request per attempt (size=1). (See grequests documentation for details).

Next, we create a for loop to run the extraction algorithm on each page whose content is saved in response.

To deal with exceptions without having the loop stuck, I went with a try -except structure. If Python finds the div class, it extracts its text, otherwise, it moves on.

We pass on the page data to beautifulsoup for content parsing and we use the lxml parser rather than HTML one because it is faster.

Then, we search for the div with this specific class and save its text attribute into a temporary variable.

Next, we extract only the numbers (i.e. without the ‘results’) and combine them into one list. Finally, we append the result of each loop to list p and we return p.

We use the third and last function to generate all the search pages, for all the cities, and put them into a list that we will use in the next story to scrape the characteristics of each property.

############ FUNCTION TO GENERATE THE LINKS TO ALL THE SEARCH PAGESdef genopage(a2list):
    city = [1, 2, 3, 4, 5, 6, 7, 8]
    u = 'https://www.propertyfinder.ae/en/search?c=2&l={}&ob=mr&page={}&rp=y&t=1'
    url = []
    
    for c, v in zip(city, range(len(a2list))):   
        n = a2list[v]
        print(n)
        for g in range(1, n):
            temp_url = u.format(c, g)
            url.append(temp_url)
    return url

Here is the explanation for each element:

The function genopage takes the a2list as an argument — basically, we will feed the output list from the 2nd function into this one.

We declare 2 lists: cities- because the links include the numbers for each city and each page and url- which is the output of the function, containing the list of all the search pages listing the properties for rent. We also declare the string variable u to store the base url in which we will replace the city and the page values.

Then, we start a for loop and say that for each pair of city (c) and maximum page number (v): we declare n to take values from the list we feed in, corresponding to the position of the city in the list. This may be more difficult to understand, but here is an example of what it means: we know that value 1 in the city list is assigned to Dubai. Also, the first number in the a2list will be the number of pages for Dubai. But Python doesn’t know that. That is why we need to tell it to.

We then print the number of pages just to check them and we create a new for loop in order to generate the URLs. Note that in this code temp_url = u.format(c, g) — c stands for the city number while g stands for each page number. We finish by appending all the URLs into the url list and returning the list.

There is one more step, which is to divide the output from the function getnoofpages by 25 and make sure we round it up to an integer as otherwise the HTML protocol will not be able to figure out a fractional number for the page number. But we will do this without a function when we actually process the data.

We are almost done, we now start the operations:

############ COLLECTING THE DATA
### We call the first function to generate the initial search pages for each citye = getnoofpages(firstpage())
print(e)

Output: [‘31071’, ‘7’, ‘836’, ‘1686’, ‘631’, ‘8103’, ‘8’, ‘440’] — you may get slightly different figures, depending on when you run the code.

What we notice is that this is a list of strings so we need to do a few extra operations:

e = list(map(int, e))         # convert the strings to integers
e = [e1 / 25 for e1 in e]     # divide by 25
e = [int(e1) for e1 in e]     # converting the results to natural         
                              # numbers
e = [e1 + 2 for e1 in e]      # add 2 to the final result
f = genopage(e)               # generate the final list of links

We need to convert the strings into numbers and then divide them by 25 (i.e. the number of properties listed on each search page) and then, declare the result as an integer as the division may generate fractional numbers that we cannot use. When making the division, there may be remainders left which means that actually, we should increase the number of pages by one. We need to add one more because in the range operation (for g in range(1, n)), Python loses one element because we start at 1 and not at 0 (which is the default value for the range function). So we add 2.

data = pd.DataFrame(f)
data.columns = ['Links']
data.to_csv('AllSearchPages-' + today + '.csv', sep=',')

The final operations are to convert the list of links into a dataframe, name the column to ‘Links’ and then save it to a csv file. Here is how the csv output looks like:

I marked the city code and the page number in red to highlight that the number of search pages for Dubai is 1243, for Umm Al Quwain is 1, Ras Al Khaimah (34), Sharjah (68), Ajman (26), Abu Dhabi (325), Fujairah (1) and Al Ain (18).

In the next exercise, I will scrape the characteristics of each property listed for rent on Property Finder and save them into a csv file for further analysis so that we finally learn what the average rental price is in each city.

Stay tuned!

The code for this exercise can be downloaded on GitHub, together with the output. You’ll need to be logged in in order to see the files.

What’s the average rental price of a 2-bed in Dubai these days?

Find it like a data scientist ( PART 1)

Written by Silviu Matei