Simulation functions

Simulation functions generate deterministic pseudo-random values for Monte Carlo simulations, synthetic data generation, and probabilistic modeling. All functions use stable hashing to ensure reproducibility across runs.

Core principle: Deterministic randomness

All simulation functions require a string key that gets hashed to produce consistent "random" values. The same key always produces the same result, enabling:

  • Reproducible simulations
  • Distributed computation without coordination
  • Time travel queries with consistent results
  • A/B testing with stable assignments

Discrete sampling

SAMPLE_VALUE()

Samples from a discrete distribution with custom probabilities.

SAMPLE_VALUE(
    key: VARCHAR,                    -- Unique identifier for deterministic sampling
    values: ARRAY(T),                -- Possible values to sample from
    weights: ARRAY(NUMERIC)          -- Relative weights (not required to sum to 1)
) -> T
sql

Parameters

  • key: String that determines the random seed (e.g., user_id + salt)
  • values: Array of possible outcomes
  • weights: Relative probabilities (automatically normalized)

Example: Single sampling

Example: Distribution validation

Use cases

  • User segmentation with custom proportions
  • Weighted random selection for recommendations
  • Simulating categorical outcomes with known distributions

Continuous distributions

SAMPLE_UNIFORM()

Samples from a uniform distribution between min and max values.

SAMPLE_UNIFORM(
    key: VARCHAR,
    min_value: DOUBLE,
    max_value: DOUBLE
) -> DOUBLE
sql

Parameters

  • key: Deterministic seed string
  • min_value: Lower bound (inclusive)
  • max_value: Upper bound (exclusive)

Example: Individual samples

Example: Distribution verification

Use cases

  • Random price variations within a range
  • Simulating uniform arrival times
  • Bootstrap sampling for confidence intervals

SAMPLE_GAUSSIAN()

Samples from a truncated Gaussian (normal) distribution.

SAMPLE_GAUSSIAN(
    key: VARCHAR,
    mean: DOUBLE,                    -- Distribution center
    std_dev: DOUBLE,                 -- Standard deviation
    min_value: DOUBLE,               -- Truncation lower bound
    max_value: DOUBLE                -- Truncation upper bound
) -> DOUBLE
sql

Parameters

  • key: Deterministic seed string
  • mean: Center of the distribution
  • std_dev: Standard deviation (spread)
  • min_value: Minimum possible value (truncation)
  • max_value: Maximum possible value (truncation)

Example: Individual samples

Example: Bell curve validation

Use cases

  • Simulating measurement errors
  • Generating realistic human characteristics (height, weight)
  • Modeling response times with natural variation

Probabilistic events

PROBABILITY_TO_EVENT()

Converts a probability into a deterministic boolean outcome.

PROBABILITY_TO_EVENT(
    key: VARCHAR,
    probability: DOUBLE              -- Between 0.0 and 1.0
) -> BOOLEAN
sql

Parameters

  • key: Deterministic seed string
  • probability: Event probability (0.0 = never, 1.0 = always)

Example: Individual events

Example: Probability validation

Use cases

  • Conversion modeling with known probabilities
  • Simulating binary outcomes (churn, purchase, click)
  • Monte Carlo simulations for risk assessment

Practical patterns

Customer lifetime value simulation

WITH
    CUSTOMER_ID := INPUT(BIGINT),
    -- Simulate purchase probability based on segment
    WILL_PURCHASE := PROBABILITY_TO_EVENT(
        'purchase_' || CUSTOMER_ID::VARCHAR,
        CASE customer_segment
            WHEN 'high_value' THEN 0.8
            WHEN 'medium_value' THEN 0.5
            ELSE 0.2
        END
    ),
    -- Simulate order value if purchasing
    ORDER_VALUE := CASE
        WHEN WILL_PURCHASE THEN
            SAMPLE_GAUSSIAN(
                'value_' || CUSTOMER_ID::VARCHAR,
                100.0,  -- mean
                30.0,   -- std dev
                10.0,   -- min
                500.0   -- max
            )
        ELSE 0.0
    END
sql

A/B test simulation

-- Assign users to variants with specific proportions
VARIANT := SAMPLE_VALUE(
    'test_v1_' || USER_ID::VARCHAR,
    ARRAY['control', 'variant_a', 'variant_b'],
    ARRAY[50, 25, 25]  -- 50% control, 25% A, 25% B
)
sql

Synthetic data generation

-- Generate realistic user attributes
WITH
    USER_ID := BIND_TABLE(SEQUENCE(1, 10000)),
SELECT
    USER_ID,
    AGE := ROUND(SAMPLE_GAUSSIAN(
        'age_' || USER_ID::VARCHAR,
        35.0, 12.0, 18.0, 80.0
    ))::INT,
    INCOME := ROUND(SAMPLE_GAUSSIAN(
        'income_' || USER_ID::VARCHAR,
        75000.0, 25000.0, 20000.0, 250000.0
    )),
    IS_SUBSCRIBER := PROBABILITY_TO_EVENT(
        'sub_' || USER_ID::VARCHAR,
        0.15  -- 15% subscription rate
    )
sql

Monte Carlo risk assessment

-- Simulate 1000 scenarios for project completion
WITH
    SCENARIO_ID := BIND_TABLE(SEQUENCE(1, 1000)),
    -- Each task has uncertain duration
    TASK1_DAYS := SAMPLE_GAUSSIAN(
        'task1_' || SCENARIO_ID::VARCHAR,
        10.0, 2.0, 5.0, 20.0
    ),
    TASK2_DAYS := SAMPLE_GAUSSIAN(
        'task2_' || SCENARIO_ID::VARCHAR,
        15.0, 3.0, 8.0, 30.0
    ),
    TOTAL_DAYS := TASK1_DAYS + TASK2_DAYS
SELECT
    APPROX_PERCENTILE(TOTAL_DAYS, 0.5) AS p50,
    APPROX_PERCENTILE(TOTAL_DAYS, 0.9) AS p90,
    APPROX_PERCENTILE(TOTAL_DAYS, 0.95) AS p95
sql

Best practices

  1. Key design: Always include a unique salt in keys to ensure independence between different simulations
  2. Validation: Test distributions with large samples to verify expected behavior
  3. Truncation: Use reasonable bounds for Gaussian sampling to avoid extreme outliers
  4. Weight normalization: Weights in SAMPLE_VALUE don't need to sum to 1; they're automatically normalized
  5. Reproducibility: Store keys used in simulations for audit and debugging purposes
Last update at: 2025/10/13 10:23:46