May 22, 2023 by Bennett Williams

Using Karrots to Measure AirBnB Impact - Part 1


InsideAirBnb: Helping Understand AirBnb’s Worldwide Impact

Inside Airbnb is an activist resource that generates unpublished AirBnb rental data in a raw form that researchers can use to understand how that business impacts neighborhoods around the world.

From the site:

How is Airbnb really being used in and affecting the neighbourhoods of your city?

Airbnb claims to be part of the “sharing economy” and disrupting the hotel industry. However, data shows that the majority of Airbnb listings in most cities are entire homes, many of which are rented all year round - disrupting housing and communities.

Lawmaker have even more to say on the subject:

“It demonstrates unequivocally that illegal short-term rentals on Airbnb have a significant negative impact on housing in New York City, taking affordable units off the market and driving up rents for everyone else.”

Liz Kreuger, New York state senator

The site is both beautiful and incredible given that it is a free resource — you should explore it! But what if you want to take a deeper dive into AirBnb’s impact? There really is no other source of short-term data. Perhaps you want to combine it with other public data sources to do principal component analysis relative to transit, wages and so on. You will need to ingest it.

Use Karrots to Deep-Dive the Inside AirBnb Dataset


Ingesting data is the messiest part of most data projects. As an activist site, InsideAirBnb is not really setup to be a data warehouse, but they do provide the last twelve months of data in csv form across several categories for dozens of cities. Downloading this data carelessly will be a burden on the project and likely violates the spirit of their data policy. For this reason we want to be efficient about how we gather this data.


For data analytics it’s better to convert the import CSV data to SQL before we move it to the data warehouse. For this we use the Karrots python package which integrates Alembic schema manager and SqlAlchemy SQL library with our zero-trust security model so that we don’t need to provide any long-lived passwords or access tokens.
Alembic allows to define a schema in a file:

    class SourceUrl(Base):
        __tablename__ = 'source_urls'
        source_url = Column(VARCHAR(265), primary_key=True)
        imported_timestamp = Column(TIMESTAMP, nullable=False, server_default=text("CURRENT_TIMESTAMP"))

Once we define the schema, the following code will check that the database tables match the current schema, and if not, will apply migrations to bring it up-to-data before we start the ingest:

    @task.external_python(task_id="validate_db_schema", python=os.environ['AIRFLOW__CORE__DAGS_FOLDER'] + "/airbnb/venv/bin/python")
    def validate_db_schema():
        import os
        from pathlib import Path
        from importlib_resources import files

        from dagfunctions import functions
        import alembic.config
        import alembic.command

        logger = functions.setup_logging()

        engine = functions.create_db_engine(logger)

        # alembic_base_path = os.path.join(Path(os.path.abspath(__file__)).parent.parent, 'schema')
        alembic_base_path = files('dagfunctions').joinpath('schema')
        alembic_cfg = alembic.config.Config(os.path.join(alembic_base_path, 'alembic.ini'))
        alembic_cfg.set_main_option('script_location', os.path.join(alembic_base_path, 'alembic'))
        alembic_cfg.attributes['connection'] = engine
        alembic.command.upgrade(alembic_cfg, "head")

        return {}