Simulation functions
FeatureQL's simulation functions generate pseudo-random values that are deterministic — the same inputs always produce the same outputs. This makes Monte Carlo simulations reproducible, synthetic data consistent across runs, and A/B test assignments stable across distributed systems.
Every simulation function takes a string key that gets hashed to produce the "random" value. By varying the key (typically entity ID + a unique salt), you get independent randomness per entity and per simulation.
Sampling from discrete distributions
SAMPLE_VALUE() picks from a set of values with custom weights. The weights don't need to sum to 1 — they're automatically normalized.
SAMPLE_VALUE(key, values, weights) -> T This example samples 10,000 users into four buckets with weights 1:2:3:4. The resulting counts confirm the distribution — roughly 10%, 20%, 30%, and 40%:
/* SQL */
SELECT VALUE, count(*) AS NUMBER
FROM
FEATUREQL(
WITH
ID := INPUT(BIGINT),
IDKEY := 'hash' || (ID::VARCHAR),
SELECT
ID := BIND_VALUES(SEQUENCE(1,10000)),
SAMPLE_VALUE(IDKEY, ARRAY['a', 'b', 'c', 'd'], ARRAY[1, 2, 3, 4]) AS VALUE
)
GROUP BY 1 ORDER BY 1| VALUE VARCHAR | NUMBER BIGINT |
|---|---|
| a | 940 |
| b | 2043 |
| c | 3032 |
| d | 3985 |
Sampling from continuous distributions
SAMPLE_UNIFORM() draws from a uniform distribution between a min and max value. Every value in the range is equally likely.
SAMPLE_UNIFORM(key, min_value, max_value) -> DOUBLE Here, 10,000 samples between 5 and 15 are bucketed by floor value. Each integer bucket gets roughly 1,000 samples, confirming uniform distribution:
/* SQL */
SELECT FLOOR(VALUE) AS FLOOR_VALUE, count(*) AS NUMBER
FROM
FEATUREQL(
WITH
ID := INPUT(BIGINT),
IDKEY := 'hash' || (ID::VARCHAR),
SELECT
ID := BIND_VALUES(SEQUENCE(1,10000)),
SAMPLE_UNIFORM(IDKEY, 5e0, 15e0) AS VALUE
)
GROUP BY 1 ORDER BY 1| FLOOR_VALUE VARCHAR | NUMBER BIGINT |
|---|---|
| 5.0 | 940 |
| 6.0 | 1026 |
| 7.0 | 1017 |
| 8.0 | 1011 |
| 9.0 | 1035 |
| 10.0 | 986 |
| 11.0 | 980 |
| 12.0 | 997 |
| 13.0 | 1016 |
| 14.0 | 992 |
SAMPLE_GAUSSIAN() draws from a truncated normal distribution — values cluster around the mean and taper off toward the bounds.
SAMPLE_GAUSSIAN(key, mean, std_dev, min_value, max_value) -> DOUBLE This samples 10,000 values from a Gaussian with mean 10, standard deviation 2, truncated to [5, 15]. The bell curve shape is visible in the counts — most values cluster around 10:
WITH
ID := INPUT(BIGINT),
SAMPLE_GAUSSIAN('hash' || (ID::VARCHAR), 10e0, 2e0, 5e0, 15e0) AS VALUE,
ROUND(VALUE)::BIGINT AS ROUND_VALUE,
cste1:=1,
ID := BIND_VALUES(SEQUENCE(1,10000)),
SELECT
ROUND_VALUE,
sum(cste1) GROUP BY ROUND_VALUE AS NUMBER,
ORDER BY ROUND_VALUE;| ROUND_VALUE BIGINT | NUMBER BIGINT |
|---|---|
| 5 | 124 |
| 6 | 270 |
| 7 | 660 |
| 8 | 1191 |
| 9 | 1686 |
| 10 | 2013 |
| 11 | 1769 |
| 12 | 1231 |
| 13 | 642 |
| 14 | 314 |
| 15 | 100 |
Converting probabilities to events
SAMPLE_BOOL() turns a probability into a deterministic yes/no outcome. Given a probability of 0.8, roughly 80% of distinct keys will return TRUE.
SAMPLE_BOOL(key, probability) -> BOOLEAN This example tests five probability thresholds across 10 users. Notice that probability 0.0 always returns FALSE and 1.0 always returns TRUE, while intermediate values produce the expected mix:
WITH
ID := INPUT(BIGINT),
IDKEY := 'hash' || (ID::VARCHAR),
SELECT
ID := BIND_VALUES(SEQUENCE(1,10)),
SAMPLE_BOOL(IDKEY, 0.00) AS PROBA_000PC,
SAMPLE_BOOL(IDKEY, 0.05) AS PROBA_005PC,
SAMPLE_BOOL(IDKEY, 0.50) AS PROBA_050PC,
SAMPLE_BOOL(IDKEY, 0.80) AS PROBA_080PC,
SAMPLE_BOOL(IDKEY, 1.00) AS PROBA_100PC,| ID BIGINT | PROBA_000PC BOOLEAN | PROBA_005PC BOOLEAN | PROBA_050PC BOOLEAN | PROBA_080PC BOOLEAN | PROBA_100PC BOOLEAN |
|---|---|---|---|---|---|
| 1 | false | true | true | true | true |
| 2 | false | false | true | true | true |
| 3 | false | false | true | true | true |
| 4 | false | false | false | false | true |
| 5 | false | false | true | true | true |
| 6 | false | false | false | true | true |
| 7 | false | false | true | true | true |
| 8 | false | false | true | true | true |
| 9 | false | false | true | true | true |
| 10 | false | false | true | true | true |
To verify at scale, the aggregate version confirms that 5% probability yields ~500 out of 10,000, 50% yields ~5,000, and so on:
/* SQL */
SELECT
count_if(PROBA_000PC) AS IS_0,
count_if(PROBA_005PC) AS ABOUT_500,
count_if(PROBA_050PC) AS ABOUT_5000,
count_if(PROBA_080PC) AS ABOUT_8000,
count_if(PROBA_100PC) AS IS_10000
FROM
FEATUREQL(
WITH
ID := INPUT(BIGINT),
IDKEY := 'hash' || (ID::VARCHAR),
SELECT
ID := BIND_VALUES(SEQUENCE(1,10000)),
SAMPLE_BOOL(IDKEY, 0.00) AS PROBA_000PC,
SAMPLE_BOOL(IDKEY, 0.05) AS PROBA_005PC,
SAMPLE_BOOL(IDKEY, 0.50) AS PROBA_050PC,
SAMPLE_BOOL(IDKEY, 0.80) AS PROBA_080PC,
SAMPLE_BOOL(IDKEY, 1.00) AS PROBA_100PC,
)| IS_0 VARCHAR | ABOUT_500 VARCHAR | ABOUT_5000 VARCHAR | ABOUT_8000 VARCHAR | IS_10000 VARCHAR |
|---|---|---|---|---|
| 0 | 468 | 5029 | 7992 | 10000 |
Practical patterns
Synthetic data generation — combine multiple distributions to create realistic test datasets:
WITH
USER_ID := BIND_VALUES(SEQUENCE(1, 10000)),
SELECT
USER_ID,
AGE := ROUND(SAMPLE_GAUSSIAN('age_' || USER_ID::VARCHAR, 35e0, 12e0, 18e0, 80e0))::BIGINT,
INCOME := ROUND(SAMPLE_GAUSSIAN('income_' || USER_ID::VARCHAR, 75000e0, 25000e0, 20000e0, 250000e0)),
IS_SUBSCRIBER := SAMPLE_BOOL('sub_' || USER_ID::VARCHAR, 0.15e0) Monte Carlo risk assessment — simulate thousands of scenarios to estimate outcome distributions:
WITH
SCENARIO_ID := BIND_VALUES(SEQUENCE(1, 1000)),
TASK1_DAYS := SAMPLE_GAUSSIAN('task1_' || SCENARIO_ID::VARCHAR, 10e0, 2e0, 5e0, 20e0),
TASK2_DAYS := SAMPLE_GAUSSIAN('task2_' || SCENARIO_ID::VARCHAR, 15e0, 3e0, 8e0, 30e0),
TOTAL_DAYS := TASK1_DAYS + TASK2_DAYS
SELECT
APPROX_PERCENTILE(TOTAL_DAYS, 0.5e0) AS p50,
APPROX_PERCENTILE(TOTAL_DAYS, 0.9e0) AS p90,
APPROX_PERCENTILE(TOTAL_DAYS, 0.95e0) AS p95 Best practices
- Always use a salt in keys to ensure independence between simulations.
'purchase_' || USER_ID::VARCHARand'churn_' || USER_ID::VARCHARproduce independent outcomes for the same user. - Validate distributions by running large samples (10,000+) and checking the shape matches expectations, as shown in the examples above.
- Use reasonable truncation bounds for Gaussian sampling — extreme outliers can distort downstream calculations.
- Store the keys used in production simulations so you can reproduce and audit results.