LLMs modeling

Core Philosophy

FeatureQL is a functional feature engineering language based on Entity-Centric Modeling (ECM). Features are pure functions that compose automatically through dependency tracking. When persisted through CREATE FEATURES AS SELECT ..., only definitions are stored, not data (equivalent to a SQL view at the column level).

Mental Model: From Problem to FeatureQL

Step 1: Identify Entities and Their Grain

  • List all entities (CUSTOMER, PRODUCT, ORDER, etc.)
  • Each entity needs one INPUT() that represents its primary key
  • This INPUT defines the "grain" at which features are computed

Step 2: Map External Data to Entities

  • Create one EXTERNAL_SQL per entity and origin table. EXTERNAL_SQL first argument must use the dialect of the backend database you are using.

  • For dimension tables:

EXTERNAL_SQL(
  'SELECT customer_email, customer_country, CAST(customer_date as DATE) as customer_created_at FROM customers', -- Note the optional casting and alias to match the mapped feature
  ON `SELF.customer_id = %CUSTOMER_ID`, -- Note the binding of the primary key column to the INPUT feature `CUSTOMER_ID` representing the entity `CUSTOMERS`
  AS ROW(customer_email VARCHAR, customer_country VARCHAR#COUNTRIES, customer_created_at DATE)
),
null
  • For fact tables, use the primary key as the grouping key and use ARRAY_AGG to create arrays of related records

  • Keep EXTERNAL_SQL queries simple:

    • SELECT with GROUP BY, no filters, no complex logic
    • Filters and business logic belong in features, not in SQL in the EXTERNAL_SQL clause

Step 3: Define Filters as Features

  • Never put WHERE clauses in EXTERNAL_SQL
  • Create boolean features for filters (IS_JULY_2017_ORDER, IS_REFERENCE_PRODUCT)
  • These features can be reused across queries

Step 4: Trace the Query Path

  • Determine the traversal: which entities do you need to walk through?
  • Example: PRODUCT → ORDERS → CUSTOMERS → ORDERS → PRODUCTS
  • At each step, you'll use EXTEND to enrich with related entity data

Step 5: Build Features by Composing

  • Features with the same INPUT automatically align (no join needed)
  • Use EXTEND() to enrich arrays with data from related entities
  • Use TRANSFORM() to filter, aggregate, or compute on arrays
  • Access nested fields with bracket notation: ORDERS[amount]

Key Patterns

Pattern 1: Generic, Reusable Base Features

-- Define entities (always generic)
CUSTOMERS := ENTITY(),
PRODUCTS := ENTITY(),

-- Define inputs (always generic)
CUSTOMER_ID := INPUT(BIGINT#CUSTOMERS),
PRODUCT_ID := INPUT(BIGINT#PRODUCTS),

-- External data (generic, no filters)
CUSTOMER_DETAILS := EXTERNAL_SQL(
  'SELECT customer_id, name, email FROM customers',
  AS ROW(customer_id BIGINT, name VARCHAR, email VARCHAR)
),
CUSTOMER_NAME := CUSTOMER_DETAILS[name],

-- Grouped relationships (generic)
CUSTOMER_ORDER_IDS := EXTERNAL_SQL(
  `SELECT customer_id, ARRAY_AGG(order_id) as order_ids
   FROM orders GROUP BY customer_id`,
  AS ROW(customer_id BIGINT, order_ids ARRAY(BIGINT#ORDERS))
)[order_ids],
sql

Pattern 2: Filters as Features

If the filter is complex, you can define it as a boolean feature. If the filter is simple, you can use it directly in the query.

-- At ORDER level
ORDER_DATE := EXTERNAL_SQL(...)[order_date],
IS_JULY_2017 := ORDER_DATE >= '2017-07-01' AND ORDER_DATE < '2017-08-01',

-- At PRODUCT level
PRODUCT_NAME := EXTERNAL_SQL(...)[product_name],
IS_REFERENCE_PRODUCT := PRODUCT_NAME = 'Target Product',
sql

Pattern 5: Array of rows handling

-- ORDER_ITEMS is already ARRAY(ROW(product_id, quantity))
-- Access field directly:
PRODUCT_IDS := ORDER_ITEMS[product_id],

-- If nested deeper:
-- ORDERS has array, each order has ORDER_ITEMS which has array
ALL_PRODUCT_IDS := CUSTOMER_ORDERS[ORDER_ITEMS][product_id],
sql

Pattern 3: Enriching Arrays with EXTEND

-- CUSTOMER has array of order_ids called CUSTOMER_ORDER_IDS of type ARRAY(BIGINT#ORDERS)
-- Enrich with order-level features (automatic join on order_id)
CUSTOMER_ORDERS_ENRICHED := EXTEND(
  ZIP(CUSTOMER_ORDER_IDS as order_id)  -- Transform array into array of rows with an order_id field
  WITH ORDER_DATE, ORDER_AMOUNT, PRODUCT_IDS  -- Features at ORDER level
  VIA order_id BOUND TO ORDER_ID   -- Join condition (optional as order_id is already annotated with ORDER_ID)
),

-- If you already have an array of rows, no ROW() needed:
-- ORDER_ITEMS is ARRAY(ROW(product_id, quantity))
ORDER_ITEMS_ENRICHED := EXTEND(
  ORDER_ITEMS
  WITH PRODUCT_NAME, PRODUCT_PRICE  -- Features at PRODUCT level
  VIA product_id BOUND TO PRODUCT_ID
),
sql

Pattern 4: Operating on Arrays with TRANSFORM

Considering CUSTOMER_ORDERS_ENRICHED of type ARRAY(ROW(order_id BIGINT#ORDERS, order_date DATE, order_amount DECIMAL, product_ids ARRAY(BIGINT#PRODUCTS)))

-- Filter array
JULY_ORDERS := CUSTOMER_ORDERS_ENRICHED.TRANSFORM(
  SELECT * WHERE ORDER_DATE BETWEEN '2017-07-01' AND '2017-07-31'
),

-- Aggregate on all rows
TOTAL_SPENT := CUSTOMER_ORDERS_ENRICHED.TRANSFORM(
  SELECT SUM(ORDER_AMOUNT)
),

-- Check existence of a product in the customer's order items
HAS_BOUGHT_REFERENCE := CUSTOMER_ITEMS_ENRICHED.TRANSFORM(
  SELECT BOOL_OR(CONTAINS(PRODUCT_IDS, PRODUCT_ID))
),

-- Aggregate by any field or derived field
TOTAL_SPENT_BY_MONTH := CUSTOMER_ORDERS_ENRICHED.TRANSFORM(
  WITH ORDER_MONTH := EXTRACT(MONTH FROM ORDER_DATE)
  SELECT
    ORDER_MONTH,
    -- Note the GROUP BY as part of the feature definition (contrarily to SQL where the GROUP BY is part of the query)
    COUNT(1) GROUP BY ORDER_MONTH AS total_orders_by_month,
    SUM(ORDER_AMOUNT) GROUP BY ORDER_MONTH AS total_spent_by_month
  ORDER BY ORDER_MONTH ASC
),

-- Get distinct values
UNIQUE_PRODUCT_IDS := CUSTOMER_ITEMS_ENRICHED.TRANSFORM(
  SELECT DISTINCT order_customer_id
),
sql

Common Mistakes to Avoid

DO NOT: Put filters in EXTERNAL_SQL

-- WRONG
JULY_ORDERS := EXTERNAL_SQL(
  'SELECT * FROM orders WHERE order_date >= ''2017-07-01''',
  ...
)
sql

DO: Filter with features

-- RIGHT
ALL_ORDERS := EXTERNAL_SQL('SELECT * FROM orders', ...),
JULY_ORDERS := ALL_ORDERS.TRANSFORM(SELECT * WHERE ORDER_DATE >= '2017-07-01'),
sql

DO NOT: Use ROW() when you already have array of rows

-- WRONG: ORDER_ITEMS is already ARRAY(ROW(...))
EXTEND(
  ROW(product_id := ORDER_ITEMS)  -- Unnecessary ROW()
  WITH PRODUCT_NAME
  VIA product_id BOUND TO PRODUCT_ID
)
sql

DO: Use ROW() only for scalar foreign keys

-- RIGHT: Converting scalar array to array of rows
EXTEND(
  ROW(order_id := CUSTOMER_ORDER_IDS)  -- CUSTOMER_ORDER_IDS is ARRAY(BIGINT)
  WITH ORDER_DATE
  VIA order_id BOUND TO ORDER_ID
)

-- RIGHT: Already array of rows
EXTEND(
  ORDER_ITEMS  -- ORDER_ITEMS is ARRAY(ROW(product_id, quantity))
  WITH PRODUCT_NAME
  VIA product_id BOUND TO PRODUCT_ID
)
sql

DO NOT: Overcomplicate entity traversal

-- If you need to count something at PRODUCT level based on CUSTOMER qualification:
-- Don't create intermediate flattening/unnesting steps
sql

DO: Keep it simple with direct enrichment

-- Get customers per product, enrich with customer features, aggregate
PRODUCT_CUSTOMER_IDS := EXTERNAL_SQL(...)  -- Array of customer_ids
CUSTOMERS_ENRICHED := EXTEND(ROW(...) WITH HAS_QUALIFICATION),
COUNT := CUSTOMERS_ENRICHED.TRANSFORM(SELECT COUNT(*) WHERE HAS_QUALIFICATION),
sql

Problem-Solving Workflow

  1. Identify the target entity (what grain is your final answer?)
  2. List all entities in the path (what do you need to traverse?)
  3. Create generic base features (ENTITY, INPUT, EXTERNAL_SQL with no filters)
  4. Define filter features (boolean conditions at appropriate entity level)
  5. Trace forward from filters (enrich arrays, transform, enrich again)
  6. Aggregate at target entity (final TRANSFORM to compute answer)

Type Annotations

  • Use BIGINT#ENTITY_NAME to annotate foreign keys
  • This enables automatic validation and binding in EXTEND
  • Example: ARRAY(BIGINT#CUSTOMERS) for array of customer IDs

Understand feature engineering flow VS analytics flow

In feature engineering, we are building features that describe the characteristics of a known entity (a customer, a product, an order, etc.) represented by a primary key (CUSTOMER_ID, PRODUCT_ID, ORDER_ID, etc.).

In analytics, the keys to bind on are defined as the set of keys present in the dataset. The keys internal to the query are less strict and can be defined on the fly (revenue per month, category grouping, top countries...)

Note: Feature engineering queries can be executed Online or Offline. Analytics queries are always executed Offline.

Query Structure for feature engineering

WITH
  -- Entity definitions
  -- Input definitions
  -- External data (generic, reusable)
  -- Filter features
  -- Derived features (composition, enrichment, transformation)
SELECT
  PRIMARY_KEY := BIND_VALUES(...) or BIND_SQL(...),
  feature1,
  feature2...
sql

Query Structure for analytics

WITH
  -- Same as feature engineering
SELECT
  PRIMARY_KEY := BIND_VALUES(...) or BIND_SQL(...),
  feature1,
  feature2
WHERE condition
ORDER BY feature1, feature2...
OFFSET n
LIMIT n;
sql

Notes:

  • feature1 and feature2 can be aggregated features, not just simple features: in that case, a final aggregation will be performed.
  • Some SQL can be added wrapping this query to add more analytics complexity that would be cumbersome in FeatureQL.

Remember

  • Features are functions, not tables
  • Keep EXTERNAL_SQL simple and generic
  • Use EXTEND for enrichment across entities
  • Use TRANSFORM for operations within an entity's arrays
  • ROW() converts scalars to rows; only use when needed
  • Filters are features, not SQL WHERE clauses
  • Arrays of rows are first-class: access fields directly with brackets
Last update at: 2025/12/05 16:03:14
Last updated: 2025-12-05 16:07:55