Entity-Centered Modeling

This page explains the Entity-Centered Modeling (ECM) approach and how to implement it with FeatureQL.

What is ECM?

Entity-Centered Modeling (ECM) organizes analytical data around business entities (customer, product, store) rather than transactions.

Each entity gets a wide, enriched view containing its attributes and arrays of related facts, eliminating most joins at query time. This is also named OBT (one Big Table) in the industry.

It's a modeling that is very intuitive for the business user and efficient as long as they dispose of the right tools to query the data.

Core Architecture Pattern

In the silver layer, once tables have been cleaned, normalized, enriched, we can build the ECM tables by aggregating the facts and adding the attributes and arrays of related facts.

-- Silver Layer: ECM tables built from normalized facts

CREATE TABLE silver.customer_obt_from_orders AS
SELECT
    customer_id,
    -- Immutable attributes
    MIN(order_date) as order_first_date,
    -- Mutable attributes with SCD2 tracking
    MAX_BY(country, order_date) as order_last_country,
    -- Arrays of facts with more or less details depending on the needs
    ARRAY_AGG(
        STRUCT(order_id, order_date, amount)
        ORDER BY order_date DESC
    ) AS order_history_sorted,
    $date_ref as ds
FROM orders
GROUP BY customer_id;

-- Silver Layer: ECM tables built from normalized facts
CREATE TABLE silver.customer_obt_from_customers AS
SELECT
    customer_id,
    -- Immutable attributes
    registration_date,
    -- Mutable attributes with SCD2 tracking
    country, segment,
    $date_ref as ds
FROM customers
GROUP BY customer_id;

-- View that joins the two tables
CREATE VIEW silver.customer_obt AS
SELECT * FROM silver.customer_obt_from_customers c
LEFT JOIN silver.customer_obt_from_orders o ON c.customer_id = o.customer_id and c.ds=o.ds;
sql

Smart History Management

Single Current Partition: Keep only the latest state as the primary table, rebuilt incrementally each day by merging new facts with yesterday's state. No wasteful daily snapshots.

SCD Type 2 for Attributes: Track changes to slowly-changing dimensions as an array of rows with value and validity dates. This preserves history without duplication.

Arrays for Point-in-Time: Historical facts stored in arrays enable time-travel without snapshots:

Key Design Principles

Build in Silver, Not Bronze: ECM tables are materialized views in the silver layer, derived from normalized cleaned and conformed tables. This preserves data lineage and enables easy rebuilds.

Incremental Not Full Refresh: After initial build, each day only processes new facts and merges with existing state. Much more efficient than daily full rebuilds.

Use Arrays for Flexibility: Store facts in sorted arrays. This enables both local computations and historical reconstruction without storing multiple snapshots.

Keep Entities Pure: Entity tables should only group by actual IDs. For coarse-grain analysis (such as country or category), use views that flatten and reaggregate:

CREATE VIEW product_category_sales AS
SELECT category, FLATTEN(ARRAY_AGG(sale_history)) as sales
FROM silver.product GROUP BY category
sql

Addressing Common Concerns

ConcernSolution
"Daily snapshots are wasteful"Don't create them. Keep one current partition + SCD2 for changes + arrays for fact history.
"What about updates to historical facts?"Rebuild the array column with ARRAY_AGG GROUP BY. Arrays come from cleaned and conformed tables.
"How do I query historical state?"Filter arrays by date or use SCD2 validity ranges. No snapshot needed.
"This wastes storage on duplication"3-5x storage for facts across entities, but 10-100x compute savings. Worth it.
"Schema evolution is complex"Add new metrics to the aggregation query. Arrays preserve raw history.

Implementation Recipe

  1. Initial Build: Create full ECM table from all historical silver facts
  2. Daily Incremental:
    • Process new facts only
    • Merge with yesterday's state
    • Update SCD2 records for changed attributes
    • Append to sorted fact arrays
  3. Set Array Boundaries: Keep all history (capped to n facts) or limit to recent period based on needs
  4. Monitor and Optimize: Check how columns are used and optimize them

When ECM Shines

  • Self-service analytics: Business users write simple filters instead of complex joins
  • Time-travel queries: Reconstruct any historical state from arrays and SCD2
  • ML features: Derived features are so fast to compute that denormalization is not needed
  • Dashboard performance: Current state loads instantly, history available on-demand

Takeaways

ECM trades storage for simplicity and speed while being smart about history. One current daily partition with SCD2 tracking and fact arrays gives you everything: current state performance, historical analysis capability, and storage efficiency without wasteful daily snapshots. Build incrementally in silver, preserve history intelligently, and deliver easy to use datasets to your users.


Based on concepts formalized by Maxime Beauchemin in 2023 with practical adaptations

Last update at: 2025/10/13 10:23:46