Simulation functions

Simulation functions generate deterministic pseudo-random values for Monte Carlo simulations, synthetic data generation, and probabilistic modeling. All functions use stable hashing to ensure reproducibility across runs.

Core principle: Deterministic randomness

All simulation functions require a string key that gets hashed to produce consistent "random" values. The same key always produces the same result, enabling:

Reproducible simulations
Distributed computation without coordination
Time travel queries with consistent results
A/B testing with stable assignments

Discrete sampling

SAMPLE_VALUE()

Samples from a discrete distribution with custom probabilities.

SAMPLE_VALUE(
    key: VARCHAR,                    -- Unique identifier for deterministic sampling
    values: ARRAY(T),                -- Possible values to sample from
    weights: ARRAY(NUMERIC)          -- Relative weights (not required to sum to 1)
) -> T

sql

Parameters

key: String that determines the random seed (e.g., user_id + salt)
values: Array of possible outcomes
weights: Relative probabilities (automatically normalized)

Example: Single sampling

Example: Distribution validation

Use cases

User segmentation with custom proportions
Weighted random selection for recommendations
Simulating categorical outcomes with known distributions

Continuous distributions

SAMPLE_UNIFORM()

Samples from a uniform distribution between min and max values.

SAMPLE_UNIFORM(
    key: VARCHAR,
    min_value: DOUBLE,
    max_value: DOUBLE
) -> DOUBLE

sql

Parameters

key: Deterministic seed string
min_value: Lower bound (inclusive)
max_value: Upper bound (exclusive)

Example: Individual samples

Example: Distribution verification

Use cases

Random price variations within a range
Simulating uniform arrival times
Bootstrap sampling for confidence intervals

SAMPLE_GAUSSIAN()

Samples from a truncated Gaussian (normal) distribution.

SAMPLE_GAUSSIAN(
    key: VARCHAR,
    mean: DOUBLE,                    -- Distribution center
    std_dev: DOUBLE,                 -- Standard deviation
    min_value: DOUBLE,               -- Truncation lower bound
    max_value: DOUBLE                -- Truncation upper bound
) -> DOUBLE

sql

Parameters

key: Deterministic seed string
mean: Center of the distribution
std_dev: Standard deviation (spread)
min_value: Minimum possible value (truncation)
max_value: Maximum possible value (truncation)

Example: Individual samples

Example: Bell curve validation

Use cases

Simulating measurement errors
Generating realistic human characteristics (height, weight)
Modeling response times with natural variation

Probabilistic events

PROBABILITY_TO_EVENT()

Converts a probability into a deterministic boolean outcome.

PROBABILITY_TO_EVENT(
    key: VARCHAR,
    probability: DOUBLE              -- Between 0.0 and 1.0
) -> BOOLEAN

sql

Parameters

key: Deterministic seed string
probability: Event probability (0.0 = never, 1.0 = always)

Example: Individual events

Example: Probability validation

Use cases

Conversion modeling with known probabilities
Simulating binary outcomes (churn, purchase, click)
Monte Carlo simulations for risk assessment

Practical patterns

Customer lifetime value simulation

WITH
    CUSTOMER_ID := INPUT(BIGINT),
    -- Simulate purchase probability based on segment
    WILL_PURCHASE := PROBABILITY_TO_EVENT(
        'purchase_' || CUSTOMER_ID::VARCHAR,
        CASE customer_segment
            WHEN 'high_value' THEN 0.8
            WHEN 'medium_value' THEN 0.5
            ELSE 0.2
        END
    ),
    -- Simulate order value if purchasing
    ORDER_VALUE := CASE
        WHEN WILL_PURCHASE THEN
            SAMPLE_GAUSSIAN(
                'value_' || CUSTOMER_ID::VARCHAR,
                100.0,  -- mean
                30.0,   -- std dev
                10.0,   -- min
                500.0   -- max
            )
        ELSE 0.0
    END

sql

A/B test simulation

-- Assign users to variants with specific proportions
VARIANT := SAMPLE_VALUE(
    'test_v1_' || USER_ID::VARCHAR,
    ARRAY['control', 'variant_a', 'variant_b'],
    ARRAY[50, 25, 25]  -- 50% control, 25% A, 25% B
)

sql

Synthetic data generation

-- Generate realistic user attributes
WITH
    USER_ID := BIND_VALUES(SEQUENCE(1, 10000)),
SELECT
    USER_ID,
    AGE := ROUND(SAMPLE_GAUSSIAN(
        'age_' || USER_ID::VARCHAR,
        35.0, 12.0, 18.0, 80.0
    ))::INT,
    INCOME := ROUND(SAMPLE_GAUSSIAN(
        'income_' || USER_ID::VARCHAR,
        75000.0, 25000.0, 20000.0, 250000.0
    )),
    IS_SUBSCRIBER := PROBABILITY_TO_EVENT(
        'sub_' || USER_ID::VARCHAR,
        0.15  -- 15% subscription rate
    )

sql

Monte Carlo risk assessment

-- Simulate 1000 scenarios for project completion
WITH
    SCENARIO_ID := BIND_VALUES(SEQUENCE(1, 1000)),
    -- Each task has uncertain duration
    TASK1_DAYS := SAMPLE_GAUSSIAN(
        'task1_' || SCENARIO_ID::VARCHAR,
        10.0, 2.0, 5.0, 20.0
    ),
    TASK2_DAYS := SAMPLE_GAUSSIAN(
        'task2_' || SCENARIO_ID::VARCHAR,
        15.0, 3.0, 8.0, 30.0
    ),
    TOTAL_DAYS := TASK1_DAYS + TASK2_DAYS
SELECT
    APPROX_PERCENTILE(TOTAL_DAYS, 0.5) AS p50,
    APPROX_PERCENTILE(TOTAL_DAYS, 0.9) AS p90,
    APPROX_PERCENTILE(TOTAL_DAYS, 0.95) AS p95

sql

Best practices

Key design: Always include a unique salt in keys to ensure independence between different simulations
Validation: Test distributions with large samples to verify expected behavior
Truncation: Use reasonable bounds for Gaussian sampling to avoid extreme outliers
Weight normalization: Weights in SAMPLE_VALUE don't need to sum to 1; they're automatically normalized
Reproducibility: Store keys used in simulations for audit and debugging purposes

Suggest changes to this page

Last update at: 2025/12/05 16:03:14