Data Cookbook Kitchen

Intro

A while ago, I wrote this blog about quantiles in Druid. I am planning to do a similar exercise for ClickHouse: this blog here is a preparatory exercise about generating the test data.

Back then, I wrote an external data generator script but I thought by myself: with the richness of ClickHouse, can this be done natively?

Let’s find out!

Recap: the Python version

Here is the Python code that I used to generat the data for the Druid version:

import time
import random
import json


def bimodal(m1, m2, s, p):

    if random.random() <= p:
        m = m1
    else:
        m = m2
    return random.gauss(m, s)


def main():
 
    distr = {
        'A': { 'fn': random.gauss,       'param': (10000, 2000) },
        'B': { 'fn': random.gauss,       'param': (10000, 10000) },
        'C': { 'fn': random.expovariate, 'param': (0.0001,) },
        'D': { 'fn': bimodal,            'param': (10000, 50000, 2000, 0.8) },
    }
    for i in range(0, 10000):
        grp = random.choices('ABCD', cum_weights=(0.50, 0.75, 0.90, 1.00), k=1)[0]
        rec = {
            'tn': time.time() - 10000 + i,
            'cn': grp,
            'rn': distr[grp]['fn'](*distr[grp]['param'])
        }
        print(json.dumps(rec))

if __name__ == "__main__":
    main()

For each data point, it selects one of four segments - not with equal probability but according to a list of cumulative weights. Each segment has its own random distribution of income values, so there are four probability density functions:

  • two normal distributions with the same mean but different stnadard deviations
  • one exponential distribution
  • on bimodal distribution, the weighted sum of two normal distributions.

Segment selection

The first step is to select the segment. I used the random.choices function to make a weighted selection (sample with replacement)

Since I used the cum_weights parameter form of the function, let’s implement

random.choices(items, cum_weights, k)

as an SQL UDF!

Creating an array of random numbers

This should be simple, right?

SELECT arrayMap(x -> randCanonical(), range(5));

but it turns out ClickHouse is a bit too smart here, the result is:

[0.047622851931395545,0.047622851931395545,0.047622851931395545,0.047622851931395545,0.047622851931395545]

so the calls to randCanonical get optimized into one!

Luckily there is a simple way to get this un-optimized by adding an optional parameter (this trick is described in the documentation). so let’s write instead:

SELECT arrayMap(x -> randCanonical(x), range(5));

and this indeed returns 5 different values!

If (like in the Python version) we don’t want to require the cumulative weights to be normalized to 1, we can use randUniform(min, max[, x]) instead of randCanonical([x]), with the same optional parameter.

Creating the randomChoice function

From the function signature above, here’s the idea for implementing the randomChoice function:

  1. Use the logic from above to generate an array of random numbers between 0 and the last value of the cum_weights array (taking advantage of addressing the last array element using index -1): arrayMap(kk -> randUniform(0.0, cum_weights[-1], kk), range(k))
  2. For each element, find the element in items whose position corresponds to the first element in cum_weights that is greater than that random number. For this, the arrayFirst function comes in handy. Its full signature is: arrayFirst(func(x[, y1, ..., yN]), source_arr[, cond1_arr, ... , condN_arr]) and it evaluates func, iterating simultaneously through all the arrays you pass into it, taking one element from each array, and returning the element of source_arr where `func` evaluates to _true_ for the first time. So, let's pass items and cum_weights into a lambda expression: x -> arrayFirst((i, c) -> c > x, items, cum_weights) where x binds to one of the random values we got from step 1, and then
  3. we iterate over the random array and put it all together: CREATE OR REPLACE FUNCTION randomChoice AS (items, cum_weights, k) -> arrayMap( x -> arrayFirst((i, c) -> c > x, items, cum_weights), arrayMap(kk -> randUniform(0.0, cum_weights[-1], kk), range(k)) );

Generating the data sample table

Random numbers

We have four different distribution functions:

  • two different instances of the normal distribution with different standard deviation, for which we can use the built in function randNormal
  • one exponential distribution; this is also built in as randExponential
  • one bimodal distribution which can be expressed as the weighted sum of two normal distributions like so:
CREATE OR REPLACE FUNCTION randBimodal AS (m1, m2, s, p, x) ->
    if (randCanonical(x) <= p,
        randNormal(m1, s, x),
        randNormal(m2, s, x)
    );

The meaning of the parameters is clear; the trailing x is passed through to the extra parameter of randNormal as described above. We’ll see why that’s needed in a moment.

The first attempt to generate the table of segments could look like this:

SELECT
    randomChoice(['A', 'B', 'C', 'D'], [0.50, 0.75, 0.90, 1.00], 1)[1]
FROM numbers(10);

But unfortunately this query populates all rows with the same value. ClickHouse has optimized the query to call randomChoice only once! Let’s try something different:

SELECT arrayJoin(randomChoice(['A', 'B', 'C', 'D'], [0.50, 0.75, 0.90, 1.00], 10));

This one works. It populates an array with randomly chosen letters and unnests the array into rows with a single letter each.

Segments

Let’s try to generate the numbers for the different salary segments. The first attempt looks like this:

SELECT 
    arrayJoin(randomChoice(['A', 'B', 'C', 'D'], [0.50, 0.75, 0.90, 1.00], 10)) AS grp,
    multiIf(
        grp = 'A', randNormal(10000, 2000),
        grp = 'B', randNormal(10000, 10000),
        grp = 'C', randExponential(0.0001),
        grp = 'D', randBimodal(10000, 50000, 2000, 0.8, 1),
        0.0
    ) AS salary;

Again, this doesn’t work as intended:

    ┌─grp─┬─────────────salary─┐
 1. │ C   │  3993.518662535571 │
 2. │ A   │ 12801.202689072426 │
 3. │ C   │  3993.518662535571 │
 4. │ D   │ 11670.909517805492 │
 5. │ A   │ 12801.202689072426 │
 6. │ A   │ 12801.202689072426 │
 7. │ B   │ 6322.6538239287565 │
 8. │ A   │ 12801.202689072426 │
 9. │ A   │ 12801.202689072426 │
10. │ A   │ 12801.202689072426 │
    └─────┴────────────────────┘

We get only one value per segment, which is repeated for every member of that segment. Maybe we can use the extra parameter, and inject it with some window function construction?

SELECT 
    arrayJoin(randomChoice(['A', 'B', 'C', 'D'], [0.50, 0.75, 0.90, 1.00], 10)) AS grp,
    row_number() OVER(PARTITION BY () ROWS UNBOUNDED PRECEDING) AS x,
    multiIf(
        grp = 'A', randNormal(10000, 2000, x),
        grp = 'B', randNormal(10000, 10000, x),
        grp = 'C', randExponential(0.0001, x),
        grp = 'D', randBimodal(10000, 50000, 2000, 0.8, x),
        0.0
    ) AS salary;

But alas, no improvement here:

    ┌─grp─┬─x─┬─────────────salary─┐
 1. │ B   │ 1 │  18379.09711361375 │
 2. │ A   │ 1 │ 12255.087812819587 │
 3. │ C   │ 1 │  2717.391640891984 │
 4. │ B   │ 1 │  18379.09711361375 │
 5. │ B   │ 1 │  18379.09711361375 │
 6. │ C   │ 1 │  2717.391640891984 │
 7. │ A   │ 1 │ 12255.087812819587 │
 8. │ D   │ 1 │ 12255.087812819587 │
 9. │ C   │ 1 │  2717.391640891984 │
10. │ A   │ 1 │ 12255.087812819587 │
    └─────┴───┴────────────────────┘

Here’s what actually works:

Instead of the arrayJoin function, we will use an ARRAY JOIN clause. That clause can unnest not only one, but multiple arrays (of equal size), which are iterated over simultaneously! The first array will be our randomly generated segment identifiers; the second one will be a sequence of consecutive numbers that give us our extra parameter to force a new call of the random functions each time:

SELECT 
    grp, 
    ind,
    multiIf(
        grp = 'A', randNormal(10000, 2000, ind),
        grp = 'B', randNormal(10000, 10000, ind),
        grp = 'C', randExponential(0.0001, ind),
        grp = 'D', randBimodal(10000, 50000, 2000, 0.8, ind),
        0.0
    ) AS salary   
FROM numbers(1)
ARRAY JOIN 
    randomChoice(['A', 'B', 'C', 'D'], [0.50, 0.75, 0.90, 1.00], 10) AS grp, 
    range(10) AS ind;

The aliases for the array elements have to be included explicitly in the SELECT list, they are not included in SELECT *.

The result looks much better:

    ┌─grp─┬─ind─┬─────────────salary─┐
 1. │ B   │   0 │ 25847.638155210378 │
 2. │ A   │   1 │ 10343.193715818396 │
 3. │ C   │   2 │  7317.574891941754 │
 4. │ A   │   3 │  12984.64975644605 │
 5. │ A   │   4 │  9723.019069980526 │
 6. │ B   │   5 │  4534.790179741936 │
 7. │ B   │   6 │   9568.73635817865 │
 8. │ D   │   7 │  10018.12918325706 │
 9. │ A   │   8 │  9159.250024384684 │
10. │ B   │   9 │ 25739.895010664135 │
    └─────┴─────┴────────────────────┘

Goodie: show the results as histogram

For a quick visual check of the generated distributions, use the histogram aggregator along with the bar function. Here for the bimodal distribution:

WITH hist AS
    (
        SELECT arrayJoin(histogram(10)(randBimodal(10000, 50000, 2000, 0.8, 1))) AS rec
        FROM numbers(1000000)
    )
SELECT
    rec.1,
    bar(rec.3, 0, 300000)
FROM hist

    ┌─tupleElement(rec, 1)─┬─bar(tupleElement(rec, 3), 0, 300000)─────────────────────────────────────────────┐
 1. │    385.5981471535233 │                                                                                  │
 2. │    4939.583032776782 │ ████████████▎                                                                    │
 3. │    7242.759950531312 │ ████████████████████████████████████████████████████████████████████████████████ │
 4. │    9874.654483671295 │ ████████████████████████████████████████████████████████████████████████████████ │
 5. │    12872.97248317489 │ ██████████████▍                                                                  │
 6. │    29185.63036295282 │                                                                                  │
 7. │    45152.40311480819 │ ███                                                                              │
 8. │   47255.662774237884 │ █████████████████████▊                                                           │
 9. │   49858.626015518545 │ ████████████████████████▊                                                        │
10. │   53256.125310489326 │ ███▋                                                                             │
    └──────────────────────┴──────────────────────────────────────────────────────────────────────────────────┘

Note that the buckets for the histogram are not equally spaced: if you want equal spacing, you need to define the bucketing and aggregation yourself.

Conclusion

  • With a bit of UDF magic, we eliminated the need for an external data generator script.
  • Be careful for calls to random functions to get optimized out. All random functions come with an optional extra parameter that can be used to force the function call to be executed for each row.
  • ARRAY JOIN can join against more than one array: in that case, all arrays are iterated over simultaneously.
  • The histogramand bar functions allow for a quick visual check of results.

This image is taken from Page 377 of Praktisches Kochbuch für die gewöhnliche und feinere Küche” by Medical Heritage Library, Inc. is licensed under CC BY-NC-SA 2.0 .