Design philosophy

FeatureQL looks like SQL. It borrows SQL's keywords, its expression syntax, and its general shape. But it is not SQL. It is a different kind of language that uses familiar syntax to solve a different problem, and some of its design decisions only make sense once you understand the goals behind them.

This page explains what FeatureQL is trying to be, what it is explicitly not, and why certain choices were made.

When to use FeatureQL, when to use SQL

FeatureQL is not a replacement for SQL. It is a complement, designed for a specific layer of the data stack.

Use SQL for data cleaning, conforming, and exploration. When you are in the silver layer, wrangling raw data into shape, SQL is the right tool. When you are exploring an unfamiliar dataset (SELECT * FROM table LIMIT 100), SQL is faster and more direct. When you need a one off aggregation you will run once and discard, SQL has less overhead.

Use FeatureQL when business logic matters enough to name, share, and reuse. This is the gold layer: customer segments, pricing rules, risk scores, KPIs. Logic that multiple people depend on, that needs to stay consistent across dashboards and APIs, that changes over time and must be auditable. This is where FeatureQL pays off.

The transition point is clear: the moment you find yourself copying a SQL calculation into a second query, that calculation should probably be a feature.

Think in entities, not in tables

SQL trains you to think in rows flowing through a pipeline: scan a table, filter rows, join more rows, group and aggregate. FeatureQL asks you to think differently: start with an entity (a customer, an order, a product), ask what you need to know about it, and express that as named features with explicit dependencies.

This is closer to how business people already think. "What is this customer's lifetime value?" is a question about an entity, not about a table. The answer depends on other things you know about the customer (their orders, their payments), which in turn depend on things you know about those orders (amounts, dates). FeatureQL lets you express this chain of reasoning directly, and the engine figures out the data fetching, joining, and aggregation.

If you have used spreadsheet formulas, the model will feel familiar. A feature is like a named cell formula: it takes inputs, references other named formulas, and produces a value. The "table" you see in query results is a presentation layer, not the core abstraction.

-- SQL thinking: "join these tables, filter, aggregate"
SELECT c.id, SUM(o.amount)
FROM customers c JOIN orders o ON c.id = o.customer_id
WHERE o.date > '2024-01-01'
GROUP BY c.id

-- FeatureQL thinking: "for a customer, what is their recent revenue?"
RECENT_REVENUE := CUSTOMER_ID.RELATED(
    SUM(ORDER_AMOUNT) GROUP BY ORDER_CUSTOMER_ID
)

sql

The FeatureQL version reads like a business rule: "a customer's recent revenue is the sum of their order amounts, grouped by the customer key on the orders." The SQL version reads like database instructions. Both compute the same thing; FeatureQL makes the intent visible.

Entity based reasoning in practice

The mental model shift shows up most clearly when working with related data across entities. In SQL, you think: "I need to JOIN orders to products to get categories, then JOIN back to customers to aggregate." In FeatureQL, you think in functional hops:

I have a customer. They have a list of order IDs (an array).
I enrich those order IDs with product categories (EXTEND: add data from another entity).
I filter and aggregate the enriched list (TRANSFORM: compute within the array).

Each step is a clear operation: fetch related data at the graph level, then compute within the array. Once this pattern clicks (usually within a couple of weeks of daily use), it becomes more intuitive than SQL joins because each step maps to a business concept rather than a database operation.

A pure functional language in SQL clothing

Every feature in FeatureQL is a pure function: given the same inputs, it always returns the same output. There is no mutable state. There are no side effects within the feature graph. Features cannot read from external systems at evaluation time, modify data, or behave differently depending on when they run.

This purity is not accidental. It is the property that makes everything else work.

Because features are pure, the engine can evaluate them in any order, cache results freely, and guarantee that a feature produces the same value whether it runs in a DuckDB notebook, a Trino warehouse, or a DataFusion serving endpoint. Because features are pure, VARIANT() is safe: replacing a dependency produces a well defined new computation with no hidden interactions. Because features are pure, the registry can validate consistency statically without executing anything.

Where does non-determinism go?

Real systems need non-deterministic values: the current date, random samples, external API responses. FeatureQL handles these at two distinct layers.

Compile time metaprogramming (@literal(), @echo()) evaluates expressions before the feature graph runs, injecting the results as constants. @literal(NOW()) does not give you "the current time per row." It gives you "the time at which the query was compiled," frozen as a literal value that every row sees identically. This is not a limitation; it is a deliberate separation between "what varies per evaluation" and "what varies per row." SQL conflates these two, and the result is subtle bugs when NOW() behaves differently across backends or execution plans.

External data sources (EXTERNAL_COLUMNS, EXTERNAL_REDIS, EXTERNAL_HTTP) are the boundary where the outside world enters the feature graph. They are declared explicitly, their types are specified at the boundary, and their results are treated as inputs to the pure graph, not as computations within it. Once data crosses this boundary, everything downstream is deterministic.

This two layer separation means you can always answer the question: "If I run this feature again with the same inputs and the same source data, will I get the same result?" The answer is always yes.

Hybrid SQL is scaffolding, not part of the language

FeatureQL queries can embed raw SQL using /* SQL */ blocks. This is useful for setup (creating tables, loading data) and for incremental adoption (wrapping FeatureQL inside existing SQL workflows).

But SQL blocks cannot be persisted in the feature registry. They exist only at query time. This is a deliberate constraint: the registry is the system of record for business logic, and everything in it is pure FeatureQL with full type safety, dependency tracking, and cross-backend portability. SQL blocks are ephemeral scaffolding, not part of the permanent feature graph.

Think of it like a REPL session: you can run arbitrary commands to set up your environment, but only the feature definitions get saved.

Strict types, permissive syntax

FeatureQL enforces two rules that may surprise SQL users.

No implicit type coercion

FeatureQL transpiles to multiple backends (DuckDB, Trino, BigQuery, DataFusion), and each handles implicit coercion differently. A query that silently widens BIGINT to DOUBLE in DuckDB might truncate precision in BigQuery. By requiring explicit casts, FeatureQL guarantees that your feature produces the same result everywhere, not just on the backend you happened to test on.

In practice, this is less painful than it sounds, because literal integers adapt their type automatically when combined with a typed feature:

WITH
    PRICE := 10.25   -- DECIMAL(4,2)
SELECT
    PRICE + 5,       -- Works: literal 5 adapts to DECIMAL
    5 + PRICE,       -- Works: same adaptation regardless of order
    PRICE + 5 + 3,   -- Works: adapts through the chain
    (PRICE + 5) + 3  -- Works: same thing with explicit grouping

sql

The literal 5 is not a fixed BIGINT; it takes on whatever type is needed by context. This covers the vast majority of business logic, where you are combining features with constants.

The one case that requires explicit casting is combining two computed features of different types. FEATURE_A + FEATURE_B where A is DECIMAL and B is BIGINT will fail, because neither is a literal that can adapt. This is the case where the compiler asks you to be explicit, and it is the case where ambiguity actually matters.

The rule to remember: literal constants adapt, computed values don't. If the compiler asks for a cast, it is because two different computations produced different types and the right resolution is genuinely ambiguous.

Edge case

An expression involving only literals produces a computed result, not a literal. 10 + 5 yields a computed BIGINT, which will not adapt. If you need a computed constant to adapt, evaluate it at compile time with @literal(10 + 5), which produces a literal that adapts normally.

Multiple syntax forms

You can write 1 AS NAME, NAME := 1, or NAME IS 1. You can use ARRAY or LIST, ROW or STRUCT, VIA or ON. This is intentional. Data practitioners come from different SQL dialects, and forcing everyone into one syntax creates unnecessary friction during adoption. Use the form you already know.

For teams that want consistency, FeatureQL supports a strict mode that normalizes all queries to a canonical syntax automatically. The registry also normalizes stored features, so persisted definitions are always consistent regardless of which syntax form the author used.

The philosophy is: be strict about semantics (types, dependencies, purity), be flexible about surface syntax.

Two levels of scope, nothing more

SQL queries can nest arbitrarily: subqueries inside subqueries inside CTEs inside views. This nesting is one of the main reasons complex SQL is hard to read, hard to debug, and hard to maintain.

FeatureQL has exactly two scope levels.

The feature graph is the main scope. Features reference other features, the engine resolves the dependency order, and everything evaluates in a flat DAG. There is no nesting. A feature that depends on another feature is not "inside" it; they are peers in the graph.

TRANSFORM inner queries are the second scope. When you operate on an array of rows (filtering, aggregating, sorting), the inner query runs in its own scope with access only to the fields inside that array. It cannot reference features from the outer graph, trigger external data fetches, or nest another TRANSFORM.

This restriction is deliberate, not a limitation. The inner query is pure and local: it sees only the data in the array, transforms it, and returns the result. If you need data from outside the array (another entity, an external source), you enrich the array first using EXTEND() at the graph level, then transform the enriched result. The pattern is always: fetch and join in the graph scope, compute in the transform scope.

This two level model means you never need to trace through nested scopes to understand what a feature does. The graph level tells you where data comes from and how entities relate. The transform level tells you how array data is processed. Nothing else.

IMPORTING and the scope boundary

TRANSFORM supports an IMPORTING clause that brings values from the outer scope into the inner query. These are materialized scalar values at the point of import, not lazy references to graph nodes. Importing a feature does not trigger any computation or data fetch; it copies an already computed value into the inner scope. The purity of the inner query is preserved.

Entity annotations: types with meaning

FeatureQL extends its type system with entity annotations: BIGINT#CUSTOMERS means "a BIGINT that identifies a customer." This annotation is not a comment. It is part of the type, checked by the compiler, and used by operations like RELATED() and EXTEND() to resolve joins automatically.

Entity annotations serve three purposes.

They prevent join errors. You cannot accidentally join a customer ID to an order ID, even though both are BIGINT. The entity annotation distinguishes them at the type level.

They enable automatic relationship resolution. When you write RELATED(ORDER_CITY VIA LAST_ORDER_ID), FeatureQL knows that LAST_ORDER_ID is BIGINT#ORDERS and ORDER_CITY depends on ORDER_ID which is also BIGINT#ORDERS. The join path is unambiguous because the types encode the entity relationships. Unlike other languages such as DAX or traditional semantic layers, nothing is implicit in RELATED: the foreign key is specified explicitly, and the features define the primary key of the entity to join unequivocally. This avoids the multiple join path ambiguities that plague systems with implicit relationship resolution.

They document the data model. Reading CUSTOMER_ID := INPUT(BIGINT#CUSTOMERS) tells you immediately what this feature represents, without checking a schema diagram or a wiki page.

Where annotation weight falls

Entity annotations are mainly declared at the data mapping boundary: when connecting FeatureQL to external tables via EXTERNAL_COLUMNS or INLINE_COLUMNS. This is typically the work of data engineering teams who define the core entities and their keys once. Business users who build features on top of these mappings rarely need to write entity annotations themselves; they just use features that already carry them.

If you are writing a quick exploratory query with inline data, annotations are optional. If you are building features for shared use in the registry, annotations are the mechanism that makes RELATED() and EXTEND() work correctly.

Features are public by name, private by omission

FeatureQL does not have PUBLIC or PRIVATE keywords. Instead, visibility follows from naming.

When you persist features with CREATE FEATURES AS, every named feature becomes part of the registry and is addressable by other users. Intermediate computations that are not given a name are persisted as part of the dependency graph but cannot be referenced directly. They are structurally inaccessible: no one can build on them, create variants of them, or depend on them.

This means the act of naming a feature is the act of making it public. If you want an internal computation to remain an implementation detail, simply do not name it in your CREATE FEATURES statement. Consumers can see that unnamed dependencies exist (via SHOW CREATE FEATURES), but they cannot address or replace them. The implementation is auditable but not programmatically coupled to.

This gives feature authors control over their API surface without requiring access control syntax: name what you want others to use, leave the rest unnamed.

Binding: how inputs become rows

The FOR clause connects the feature graph to actual values. While FeatureQL offers several binding functions, you only need two for most work.

The two binding modes you need

BIND_VALUES evaluates features for a list of specific values. Use this when you know exactly which inputs to evaluate:

FOR CUSTOMER_ID := BIND_VALUES(ARRAY[100, 101, 102])
-- Produces 3 rows: one per customer

sql

BIND_COLUMNS evaluates features for every value in a table column. Use this when you want to evaluate across an entire dataset:

FOR CUSTOMER_ID := BIND_COLUMNS(customer_id FROM TABLE(analytics.dim_customers))
-- Produces one row per customer in the table

sql

These two modes cover the vast majority of queries. Other binding functions (BIND_VALUE for a single value, BIND_SQL for a SQL subquery, @BIND_KEYSET for predefined entity sets) exist for specialized cases, but you can ignore them until you need them.

Pairing multiple inputs

When binding multiple inputs together, wrap them in ROW tuples to pair values one to one:

FOR (INPUT1, INPUT2) := BIND_VALUES(ARRAY[ROW(1, 10), ROW(2, 20), ROW(3, 30)])
-- Produces 3 rows: (1, 10), (2, 20), (3, 30)

sql

If you bind multiple inputs with separate BIND_VALUES calls instead, FeatureQL produces the cross product of all combinations:

FOR
    INPUT1 := BIND_VALUES(ARRAY[1, 2, 3]),
    INPUT2 := BIND_VALUES(ARRAY[10, 20])
-- Produces 6 rows: every combination of INPUT1 × INPUT2

sql

The cross product behavior is intentional and useful for evaluating features across all combinations of two dimensions. If you find yourself with more rows than expected, check whether you meant to pair values (use ROW tuples) or cross them (use separate bindings).

Working with namespaces

Persisted features live in namespaces (FM.FINANCE.PROJECTS.REVENUE). A query that uses several features from the same namespace can get verbose. Two mechanisms keep things concise.

Default namespace with FROM: Set a default namespace for the query so features can be referenced by their short name:

SELECT
    CUSTOMER_ID,
    CUSTOMER_SEGMENT
FROM FM.ECOMM

sql

Namespace aliases with underscore prefix: When pulling features from multiple namespaces, use aliases to shorten references:

SELECT
    _e.CUSTOMER_ID,
    _b.INVOICE_TOTAL
FROM FM.ECOMM AS _e, FM.BILLING AS _b

sql

The underscore prefix convention distinguishes namespace aliases from feature names, keeping queries readable without ambiguity.

Composition: build on what exists

The core workflow in FeatureQL is not writing features from scratch. It is composing existing features into new ones.

Reference any published feature by name. If someone has published CUSTOMER_LIFETIME_VALUE to the registry, you use it directly. You do not copy its logic, you do not need to know how it is computed, and you do not need to set up the data sources it depends on. All of that is resolved automatically from the registry.

Override any named dependency locally. If a feature does almost what you need but one of its dependencies should be different, use VARIANT() to replace that dependency in your query without modifying the original:

-- Use the standard customer segment, but with a different activity threshold
MY_SEGMENT := VARIANT(CUSTOMER_SEGMENT REPLACING AD_THRESHOLD WITH 30)

sql

The original CUSTOMER_SEGMENT is unchanged. Your variant shares the entire dependency graph except for the one piece you replaced. This is the mechanism for what if analysis, A/B testing, and client specific customization. Only named dependencies can be replaced; unnamed intermediate computations (implementation details) are not addressable.

Mix FeatureQL with SQL freely at query time. For parts of your query where SQL is more natural (setup, exploration, one off aggregations), embed raw SQL alongside FeatureQL. The SQL parts are query time scaffolding; only the FeatureQL feature definitions can be persisted to the registry.

This means adoption is gradual. You do not need to rewrite your existing SQL. Start by defining one feature that encapsulates a calculation you reuse, then build from there.

Designed for teams, optional for individuals

FeatureQL is built for the case where multiple people need to share, reuse, and trust the same business logic. If you are one person doing your own analytics, SQL is fine.

The language makes a deliberate split between two roles.

Data engineering teams map external data to features: they declare entities, define key relationships, set up source connections, and publish core features to the registry. This work involves entity annotations, EXTERNAL_COLUMNS, and CREATE FEATURES statements. It is schema work, done once and shared broadly.

Business users build on top of published features: they write queries using features that already exist, define local features for their analysis, and use VARIANT() to test alternatives. They rarely need to think about data sources, join paths, or entity annotations, because that plumbing is already done.

This separation is why some parts of FeatureQL (entity declarations, source mappings, type annotations) feel heavier than equivalent SQL. They are designed for the infrastructure layer, where precision and explicitness pay off. The consumption layer, where most queries happen, is lighter: reference a feature by name, bind inputs, get results.

Discoverability

A language built on composition requires strong discoverability: you need to know what exists before you can build on it.

FeatureQL provides exploration functions directly in the query environment:

SHOW FEATURES — Browse the registry, filter by namespace, type, or dependency
SHOW CREATE FEATURES — See the exact definition of any feature
SHOW FUNCTION SIGNATURES — Get signatures and documentation for built in functions
SHOW DOCS — Browse documentation from inside a query

Dependency exploration is also built in. You can query the feature graph itself to understand what depends on what, trace upstream dependencies, and assess the impact of changing a feature. See Explore features for the full set of exploration capabilities.

This means the registry is not just a store of feature definitions; it is a live, queryable catalog where lineage, documentation, and code are the same thing, not parallel artifacts that drift apart.

Suggest changes to this page

Last update at: 2026/03/03 16:47:38