May 22, 2023 by Bennett Williams
Using Karrots to Measure AirBnB Impact - Part 1
InsideAirBnb: Helping Understand AirBnb’s Worldwide Impact
Inside Airbnb is an activist resource that generates unpublished AirBnb rental data in a raw form that researchers can use to understand how that business impacts neighborhoods around the world.
From the site:
How is Airbnb really being used in and affecting the neighbourhoods of your city?
Airbnb claims to be part of the “sharing economy” and disrupting the hotel industry. However, data shows that the majority of Airbnb listings in most cities are entire homes, many of which are rented all year round - disrupting housing and communities.
Lawmaker have even more to say on the subject:
“It demonstrates unequivocally that illegal short-term rentals on Airbnb have a significant negative impact on housing in New York City, taking affordable units off the market and driving up rents for everyone else.”
Liz Kreuger, New York state senator
The site is both beautiful and incredible given that it is a free resource — you should explore it! But what if you want to take a deeper dive into AirBnb’s impact? There really is no other source of short-term data. Perhaps you want to combine it with other public data sources to do principal component analysis relative to transit, wages and so on. You will need to ingest it.
Use Karrots to Deep-Dive the Inside AirBnb Dataset
Ingest
Ingesting data is the messiest part of most data projects. As an activist site, InsideAirBnb is not really setup to be a data warehouse, but they do provide the last twelve months of data in csv form across several categories for dozens of cities. Downloading this data carelessly will be a burden on the project and likely violates the spirit of their data policy. For this reason we want to be efficient about how we gather this data.
Schema
For data analytics it’s better to convert the import CSV data to SQL before we move it to the data warehouse. For this we use the Karrots python package which integrates Alembic schema manager and SqlAlchemy SQL library with our zero-trust security model so that we don’t need to provide any long-lived passwords or access tokens.
Alembic allows to define a schema in a metadata_model.py
file:
class SourceUrl(Base):
__tablename__ = 'source_urls'
source_url = Column(VARCHAR(265), primary_key=True)
imported_timestamp = Column(TIMESTAMP, nullable=False, server_default=text("CURRENT_TIMESTAMP"))
Once we define the schema, the following code will check that the database tables match the current schema, and if not, will apply migrations to bring it up-to-data before we start the ingest:
@task.external_python(task_id="validate_db_schema", python=os.environ['AIRFLOW__CORE__DAGS_FOLDER'] + "/airbnb/venv/bin/python")
def validate_db_schema():
import os
from pathlib import Path
from importlib_resources import files
from dagfunctions import functions
import alembic.config
import alembic.command
logger = functions.setup_logging()
engine = functions.create_db_engine(logger)
# alembic_base_path = os.path.join(Path(os.path.abspath(__file__)).parent.parent, 'schema')
alembic_base_path = files('dagfunctions').joinpath('schema')
alembic_cfg = alembic.config.Config(os.path.join(alembic_base_path, 'alembic.ini'))
alembic_cfg.set_main_option('script_location', os.path.join(alembic_base_path, 'alembic'))
alembic_cfg.attributes['connection'] = engine
alembic.command.upgrade(alembic_cfg, "head")
return {}