Simulation functions

FeatureQL's simulation functions generate pseudo-random values that are deterministic — the same inputs always produce the same outputs. This makes Monte Carlo simulations reproducible, synthetic data consistent across runs, and A/B test assignments stable across distributed systems.

Every simulation function takes a string key that gets hashed to produce the "random" value. By varying the key (typically entity ID + a unique salt), you get independent randomness per entity and per simulation.

Sampling from discrete distributions

SAMPLE_VALUE() picks from a set of values with custom weights. The weights don't need to sum to 1 — they're automatically normalized.

SAMPLE_VALUE(key, values, weights) -> T
sql

This example samples 10,000 users into four buckets with weights 1:2:3:4. The resulting counts confirm the distribution — roughly 10%, 20%, 30%, and 40%:

FeatureQL
/* SQL */
SELECT VALUE, count(*) AS NUMBER
FROM
FEATUREQL(
    WITH
        ID := INPUT(BIGINT),
        IDKEY := 'hash' || (ID::VARCHAR),
    SELECT
        ID := BIND_VALUES(SEQUENCE(1,10000)),
        SAMPLE_VALUE(IDKEY, ARRAY['a', 'b', 'c', 'd'], ARRAY[1, 2, 3, 4]) AS VALUE
)
GROUP BY 1 ORDER BY 1
Result
VALUE VARCHARNUMBER BIGINT
a940
b2043
c3032
d3985

Sampling from continuous distributions

SAMPLE_UNIFORM() draws from a uniform distribution between a min and max value. Every value in the range is equally likely.

SAMPLE_UNIFORM(key, min_value, max_value) -> DOUBLE
sql

Here, 10,000 samples between 5 and 15 are bucketed by floor value. Each integer bucket gets roughly 1,000 samples, confirming uniform distribution:

FeatureQL
/* SQL */
SELECT FLOOR(VALUE) AS FLOOR_VALUE, count(*) AS NUMBER
FROM
FEATUREQL(
    WITH
        ID := INPUT(BIGINT),
        IDKEY := 'hash' || (ID::VARCHAR),
    SELECT
        ID := BIND_VALUES(SEQUENCE(1,10000)),
        SAMPLE_UNIFORM(IDKEY, 5e0, 15e0) AS VALUE
)
GROUP BY 1 ORDER BY 1
Result
FLOOR_VALUE VARCHARNUMBER BIGINT
5.0940
6.01026
7.01017
8.01011
9.01035
10.0986
11.0980
12.0997
13.01016
14.0992

SAMPLE_GAUSSIAN() draws from a truncated normal distribution — values cluster around the mean and taper off toward the bounds.

SAMPLE_GAUSSIAN(key, mean, std_dev, min_value, max_value) -> DOUBLE
sql

This samples 10,000 values from a Gaussian with mean 10, standard deviation 2, truncated to [5, 15]. The bell curve shape is visible in the counts — most values cluster around 10:

FeatureQL
WITH
    ID := INPUT(BIGINT),
    SAMPLE_GAUSSIAN('hash' || (ID::VARCHAR), 10e0, 2e0, 5e0, 15e0) AS VALUE,
    ROUND(VALUE)::BIGINT AS ROUND_VALUE,
    cste1:=1,
    ID := BIND_VALUES(SEQUENCE(1,10000)),
SELECT
    ROUND_VALUE,
    sum(cste1) GROUP BY ROUND_VALUE AS NUMBER,
ORDER BY ROUND_VALUE;
Result
ROUND_VALUE BIGINTNUMBER BIGINT
5124
6270
7660
81191
91686
102013
111769
121231
13642
14314
15100

Converting probabilities to events

SAMPLE_BOOL() turns a probability into a deterministic yes/no outcome. Given a probability of 0.8, roughly 80% of distinct keys will return TRUE.

SAMPLE_BOOL(key, probability) -> BOOLEAN
sql

This example tests five probability thresholds across 10 users. Notice that probability 0.0 always returns FALSE and 1.0 always returns TRUE, while intermediate values produce the expected mix:

FeatureQL
WITH
    ID := INPUT(BIGINT),
    IDKEY := 'hash' || (ID::VARCHAR),
SELECT
    ID := BIND_VALUES(SEQUENCE(1,10)),
    SAMPLE_BOOL(IDKEY, 0.00) AS PROBA_000PC,
    SAMPLE_BOOL(IDKEY, 0.05) AS PROBA_005PC,
    SAMPLE_BOOL(IDKEY, 0.50) AS PROBA_050PC,
    SAMPLE_BOOL(IDKEY, 0.80) AS PROBA_080PC,
    SAMPLE_BOOL(IDKEY, 1.00) AS PROBA_100PC,
Result
ID BIGINTPROBA_000PC BOOLEANPROBA_005PC BOOLEANPROBA_050PC BOOLEANPROBA_080PC BOOLEANPROBA_100PC BOOLEAN
1falsetruetruetruetrue
2falsefalsetruetruetrue
3falsefalsetruetruetrue
4falsefalsefalsefalsetrue
5falsefalsetruetruetrue
6falsefalsefalsetruetrue
7falsefalsetruetruetrue
8falsefalsetruetruetrue
9falsefalsetruetruetrue
10falsefalsetruetruetrue

To verify at scale, the aggregate version confirms that 5% probability yields ~500 out of 10,000, 50% yields ~5,000, and so on:

FeatureQL
/* SQL */
SELECT
    count_if(PROBA_000PC) AS IS_0,
    count_if(PROBA_005PC) AS ABOUT_500,
    count_if(PROBA_050PC) AS ABOUT_5000,
    count_if(PROBA_080PC) AS ABOUT_8000,
    count_if(PROBA_100PC) AS IS_10000
FROM
FEATUREQL(
    WITH
        ID := INPUT(BIGINT),
        IDKEY := 'hash' || (ID::VARCHAR),
    SELECT
        ID := BIND_VALUES(SEQUENCE(1,10000)),
        SAMPLE_BOOL(IDKEY, 0.00) AS PROBA_000PC,
        SAMPLE_BOOL(IDKEY, 0.05) AS PROBA_005PC,
        SAMPLE_BOOL(IDKEY, 0.50) AS PROBA_050PC,
        SAMPLE_BOOL(IDKEY, 0.80) AS PROBA_080PC,
        SAMPLE_BOOL(IDKEY, 1.00) AS PROBA_100PC,
)
Result
IS_0 VARCHARABOUT_500 VARCHARABOUT_5000 VARCHARABOUT_8000 VARCHARIS_10000 VARCHAR
04685029799210000

Practical patterns

Synthetic data generation — combine multiple distributions to create realistic test datasets:

WITH
    USER_ID := BIND_VALUES(SEQUENCE(1, 10000)),
SELECT
    USER_ID,
    AGE := ROUND(SAMPLE_GAUSSIAN('age_' || USER_ID::VARCHAR, 35e0, 12e0, 18e0, 80e0))::BIGINT,
    INCOME := ROUND(SAMPLE_GAUSSIAN('income_' || USER_ID::VARCHAR, 75000e0, 25000e0, 20000e0, 250000e0)),
    IS_SUBSCRIBER := SAMPLE_BOOL('sub_' || USER_ID::VARCHAR, 0.15e0)
sql

Monte Carlo risk assessment — simulate thousands of scenarios to estimate outcome distributions:

WITH
    SCENARIO_ID := BIND_VALUES(SEQUENCE(1, 1000)),
    TASK1_DAYS := SAMPLE_GAUSSIAN('task1_' || SCENARIO_ID::VARCHAR, 10e0, 2e0, 5e0, 20e0),
    TASK2_DAYS := SAMPLE_GAUSSIAN('task2_' || SCENARIO_ID::VARCHAR, 15e0, 3e0, 8e0, 30e0),
    TOTAL_DAYS := TASK1_DAYS + TASK2_DAYS
SELECT
    APPROX_PERCENTILE(TOTAL_DAYS, 0.5e0) AS p50,
    APPROX_PERCENTILE(TOTAL_DAYS, 0.9e0) AS p90,
    APPROX_PERCENTILE(TOTAL_DAYS, 0.95e0) AS p95
sql

Best practices

  • Always use a salt in keys to ensure independence between simulations. 'purchase_' || USER_ID::VARCHAR and 'churn_' || USER_ID::VARCHAR produce independent outcomes for the same user.
  • Validate distributions by running large samples (10,000+) and checking the shape matches expectations, as shown in the examples above.
  • Use reasonable truncation bounds for Gaussian sampling — extreme outliers can distort downstream calculations.
  • Store the keys used in production simulations so you can reproduce and audit results.
Last update at: 2026/03/03 16:47:38
Last updated: 2026-03-03 16:48:19