Leverage Entity-Centered Modeling

Entity-Centered Modeling (ECM) organizes analytical data around business entities — customers, products, stores — rather than transactions. Each entity gets a wide, enriched row containing its attributes and arrays of related facts, eliminating most joins at query time. The industry also calls this One Big Table (OBT).

ECM is intuitive for business users and performant for analytical queries, provided you have the right tools to work with nested arrays. FeatureQL's TRANSFORM(), EXTEND(), and array operations are designed exactly for this.

Core architecture

In the silver layer, after tables have been cleaned and normalized, you build ECM tables by aggregating facts per entity:

-- Aggregate order facts per customer
CREATE TABLE silver.customer_obt_from_orders AS
SELECT
    customer_id,
    MIN(order_date) AS order_first_date,
    MAX_BY(country, order_date) AS order_last_country,
    ARRAY_AGG(
        STRUCT(order_id, order_date, amount)
        ORDER BY order_date DESC
    ) AS order_history_sorted,
    $date_ref AS ds
FROM silver.orders_normalized
GROUP BY customer_id;

-- Customer attributes
CREATE TABLE silver.customer_obt_from_customers AS
SELECT
    customer_id,
    registration_date,
    country, segment,
    $date_ref AS ds
FROM silver.customers_normalized
GROUP BY customer_id;

-- Join into a single wide table
CREATE VIEW silver.customer_obt AS
SELECT * FROM silver.customer_obt_from_customers c
LEFT JOIN silver.customer_obt_from_orders o
    ON c.customer_id = o.customer_id AND c.ds = o.ds;
sql

What goes in fact arrays?

The key insight: be selective. Only include fields that you filter on frequently:

  • IDs for lookup joins (customer_id, product_id, store_id) — integers that compress extremely well
  • Timestamps for filtering (order_date, event_time) — essential for time-based analysis and partition pruning
  • Low-cardinality enums (status, event_type, channel) — compress well and are common filter predicates

Everything else — product names, descriptions, store addresses — stays in dimension tables and gets joined only after filtering. Since you typically filter down to a small fraction of rows (e.g., last 30 days out of a year), the subsequent lookup joins are cheap.

Avoid computed columns in fact arrays

Don't add attributes that depend on other tables. This introduces circular dependencies and slows down the materialization pipeline.

Smart history management

Single current partition. Keep only the latest state, rebuilt incrementally each day by merging new facts with yesterday's state. No wasteful daily snapshots.

SCD Type 2 for attributes. Track changes to slowly-changing dimensions as an array of rows with value and validity dates. This preserves history without row duplication.

Arrays for point-in-time analysis. Historical facts stored in sorted arrays enable time-travel queries without maintaining separate snapshots.

Key design principles

Build on normalized tables. ECM tables are materialized views derived from cleaned, conformed silver-layer tables. This preserves data lineage and makes rebuilds straightforward.

Increment, don't rebuild. After the initial load, each day processes only new facts and merges them with existing state. Much cheaper than daily full refreshes.

Keep entities pure. Entity tables group by actual entity IDs. For coarse-grain analysis (by country, by category), create views that flatten and reaggregate:

CREATE VIEW product_category_sales AS
SELECT category, FLATTEN(ARRAY_AGG(sale_history)) AS sales
FROM silver.product GROUP BY category;
sql

Implementation recipe

  1. Initial build — Create the full ECM table from all historical silver facts.
  2. Daily incremental — Process new facts, merge with yesterday's state, update SCD2 records for changed attributes, append to sorted fact arrays.
  3. Set array boundaries — Cap arrays at N facts or limit to a recent time window, depending on query patterns.
  4. Monitor and optimize — Profile which columns are actually queried and trim unused ones.

Common concerns

ConcernAnswer
"Daily snapshots are wasteful"Don't create them. One current partition + SCD2 for changes + arrays for fact history covers all use cases.
"Schema evolution is complex"Rebuild the array column with ARRAY_AGG GROUP BY. Arrays come from cleaned, conformed tables.
"How do I query historical state?"Filter arrays by date or use SCD2 validity ranges. No snapshots needed.
"Joins are expensive"You're joining on pre-filtered data (often <10% of rows). Modern columnar databases handle ID lookups efficiently at that scale.
"This duplicates data across entities"Only IDs and timestamps are duplicated. They compress extremely well. The storage cost is minimal compared to compute savings.

When ECM shines

  • Self-service analytics — Business users write simple filters instead of multi-table joins.
  • Time-travel queries — Reconstruct any historical state from arrays and SCD2 records.
  • ML feature engineering — Derived features compute fast on pre-aggregated arrays without additional denormalization.
  • Dashboard performance — Most charts need only a single table scan, reducing the need for pre-built aggregation tables.

ECM trades modest storage overhead for simplicity and speed. One current partition with SCD2 tracking and fact arrays gives you current-state performance, historical analysis, and storage efficiency — without wasteful daily snapshots.


Based on concepts formalized by Maxime Beauchemin in 2023

Last update at: 2026/03/18 16:18:57
Last updated: 2026-03-18 16:19:34