Ibis compatibility checker

I'm a fan of the Ibis framework. To me, dataFrame APIs feel great and leaving pure Python to write a SQL query in a nasty triple-quoted string does not.

In this post, I motivate and give a proof of concept for an Ibis "compatibility checker" where, given an Ibis expression, the checker will tell you which backends it can be run on.

This post isn't meant to sell Ibis, just motivate why I think it could be useful for a specific set of people.

Code for this post can be found here.

Motivation

Let's say I want to write a reusable feature engineering pipeline. Say, I have a good idea for building features for churn modeling. I want to roll this out to all my customers on their infrastructure. My customers all have different levels of maturity and (of course) different infrastructure.

Not an exhaustive list, but some options are:

Pandas for everything
Rely on PySpark
Write everything in SQL
Use a fancy tool like DBT

These all have positives and negatives, but if my requirements are:

Pythonic API
No vendor lock in
Scalable to realistic datasets¹
Few assumptions about infrastructure

Now, I'm definitely not a data engineer. If you are, you're likely scoffing and saying "tool X would solve all of this. What a fool." You're probably right! I'm no expert here.

However, you are the one reading this post! And I have been struck by how it seems like Ibis helps solve most of these points.

Ibis lets us assume we can at least get our data into a particular data model and then run things like feature engineering in a portable way. It won't matter if we have flat files or an OLAP database. We can write some code to fit the raw data into our model² and then run our feature engineering or analytics or whatever.

However, a question that I could never answer was: given an Ibis expression, which backends will it run on?

Ibis has a great documentation page here. This answers this exact question, but one would have to search every single operation iteratively. If you're trying to maximize compatibility, this would be cumbersome.

So, I set out to try to do this automatically.

Checking compatibility

Out of the box, each Backend class in Ibis has a has_operation method. This does some complex delegation out to a Backend's internals. What matters is, this code will return False:

from ibis.backends.duckdb import Backend
import ibis.expr.operations as ops

Backend().has_operation(ops.HashBytes)

and this code will return True:

from ibis.backends.duckdb import Backend
import ibis.expr.operations as ops

Backend().has_operation(ops.Filter)

because the duckdb backend does not support the HashBytes operation and does support the Filter operation.

Given this, it seems the job is done! We can just run this in a loop across an expression's operations and return which backends will play nicely.

We could get inspired by the code that builds the operation support matrix and call it a day!

Dependency hell

Of course, it's not that easy. At the time of writing, Ibis has \(20\) backends. In order to import all \(20\) backends, I have to install all Python packages and system-level dependencies for those backends.

Ibis has lots of options for developing their tool with all these dependencies. However, none of them really suit the tool I'm trying to build.

To use a hypothetical "ibis compatibility checker", one would also have to have every single dependency installed. This renders the tool useless in my opinion.

The checker

The Ibis documentation has to be built at some regular interval. So, the operation support matrix is updated at some regular interval. Why not just scrape this big table at runtime?

So that's what I did! You can see my implementation details here.

This way, the only requirements for the library are:

Ibis itself (with no backend dependencies)
Beautiful soup
httpx

This isn't perfect as it relies on the docs having the same format in perpetuity. However, it's good enough for what I wanted.

Most importantly, we rely on the Ibis team to maintain their developer environment and build the docs for us. Which minimizes our dependencies.

Examples

Finally, we can see some examples of backend checks on simple tests. This happen to also be unit tests in the package code.

For more on installation, see the repository.

import ibis
from ibis import _
from ibis_compatibility import Checker

checker = Checker()
expr = ibis.literal(1) + 2

print(checker.compatible_backends(expr).backends)

This code will output all backends since all backends support literals.

A more interesting example is:

t = ibis.examples.penguins.fetch()

expr = t.bill_length_mm.argmin(t.species)

result = checker.compatible_backends(expr)

print(result.backends)

This will output (formatted for clarity):

[
    "athena",
    "bigquery",
    "clickhouse",
    "databricks",
    "datafusion",
    "duckdb",
    "postgres",
    "pyspark",
    "risingwave",
    "snowflake",
    "sqlite",
    "trino",
]

Which is helpful. But, for mere mortals that don't have all the backends memorized, we might also want to know which backends have been exclude.

We can see this and which operations kicked the backends out with:

print(result.restricted_operations)

which will output (again formatted):

{
    "ArgMin": [
        "druid",
        "exasol",
        "flink",
        "impala",
        "mssql",
        "mysql",
        "oracle",
    ],
    "DatabaseTable": ["polars"],
}

Now we know, if we want to continue to use mysql, we should rethink our use of ArgMin.

For more on what a realistic dataset size is for enterprise use cases, see here. ↩
Certainly this is easier said than done. ↩