Scraping Flats for Fun and Profit

Some time ago I’ve published GenevaFlats but I’ve never took the time to write about how I came to build it in the first place. Today I’m fixing that.

The Motivation

When I first arrived in the region, I was just as frustrated as everyone else. I remember sharing my lamentations over lunch with a friend when he offered this enlightening anecdote: He was having coffee and scanning the newspaper¹ for apartments when, after venting his frustrations to the waiter, the server dispensed this pearl of wisdom…

“you shouldn’t be looking at the classified ads section. That’s where everyone is looking! Instead you should look at the obituaries to know when a flat becomes available before anyone else does”

It was not clear to me if this was intended as a joke or not, particularly because I don’t think the address is commonly included in an obituary. But in any case this story got me thinking that perhaps a different approach was required if I wanted to find a flat.

First take

While going through the common classified ads websites I came to realise that not all the properties shown in the websites of the real state agents were there. Why? I honestly have no idea, since logic would suggest agents would want to advertise their listings everywhere possible. Whatever the reason is, I thought those “hidden” flats might be key to getting my own place.

But it annoyed me to go through all those websites daily. That’s when I though why spend hours browsing all those websites when I could spend days automatising the whole thing instead? Also it just so happened to had read Data Science at the Command Line very recently and I wanted to play with stringing together some good old curl, jq, awk, and the like to see how far this rabbithole could take me. To my own surprise I managed to do a mvp that spat out a plain text file with all the new flats published that day. It wasn’t pretty, but it worked.

Making it usable

Indeed it worked so well that I did find a flat and forgot about this whole business for years, when a friend asked me for “the script that got me a flat in Geneva”. I’m ashamed to admit that I kind of lost the original script. I had some bits and pieces but nothing that worked. However at this point I was playing with a new server and I thought it’d be fun to have a workload that was useful, so I rewrote the whole thing with scrapy.

I started creating a Flask backend + ReactJS frontend to present the data as a website, but I quickly abandoned the idea when I realised that I could just generate the pages statically:

   3 dates filter (today, last week, last month)
  x5 room filters
  x8 price filters
  x3 sorting fields (date, number of rooms, price)
  x2 sorting orders (ascending or descending)
------------------------------------------------
 720 files to be created

The size of each page depends on the length of the list of flats obviously, but even the largest ones came at about 100K, so I knew I’d need less than 100K x 720 pages = 72MB of storage. In reality it takes only ~30MB. This has a number of advantages:

All content is statically served. This also makes caching very easy since we know that each file will be regenerated after 24h. I make no money on this project so keeping the server bills low is important. Here’s the relevant nginx config:

  location /flats {
    alias /var/www/flats;
    index index.html;

    # ... etc ...

    # Add cache control for index.html files
    location ~* /flats/(.*/)?index\.html$ {
      alias /var/www/flats/$1index.html;
      add_header Cache-Control "public, max-age=86400";
      expires modified +24h;
    }
  }

No client side rendering and no JS dependencies. As a matter of fact there’s barely any JS at all, only a small bit redirect to the correct folder depending on the <select> values:


function getUrl() {
  let base_url = "https://www.mich0.com/flats";
  price = document.getElementById('filter-price').value;
  rooms = document.getElementById('filter-rooms').value;
  date = document.getElementById('filter-date').value;
  sorting = "sort-by-price-asc";
  return `${base_url}/${date}/${rooms}/${price}/${sorting}/`;
}

document.querySelectorAll('[data-ui-type="filter"]').forEach(function (el) {
  el.addEventListener('change', function (e) {
      location.href = getUrl();
  })
});

I get “pretty URLs” by default: https://www.mich0.com/flats/yesterday/1-rooms/3500-chf/sort-by-price-asc/ really just serves the file /yesterday/1-rooms/3500-chf/sort-by-price-asc/index.html

Next Steps?

I’m quite satisfied with the project as it is and I’m not planning on adding more functionality. However I think it’ll be interesting to come have a look at all the data that I’ve gathered after say a year or two and see what patterns we can find in the rental market. But that’ll be for another time!

Yes it was already unusual back then to look for flats in printed media rather than online. ↩︎

The Motivation#

First take#

Making it usable#

Next Steps?#

The Motivation

First take

Making it usable

Next Steps?