Aggregate Functions

Overview

Aggregate functions operate on a set of input rows and compute one result per group. The functions on this page are available in Polars SQL.

Except for count(), count_if(), and approx_distinct(), aggregate functions ignore null input values and return null when there are no non-null input values. For example, sum() returns null rather than zero for an empty input group, and avg() does not include null values in the count.

General Aggregate Functions

any_value(x) -> [same as input]()

This is an alias for arbitrary().

arbitrary(x) -> [same as input]()

Returns an arbitrary non-null value of x, if one exists.

avg(x) -> double()

Returns the average (arithmetic mean) of all non-null input values.

SELECT avg(v)
FROM (VALUES (1), (NULL), (3)) AS t(v);
-- 2.000
count(*) -> bigint()

Returns the number of input rows.

SELECT count(*)
FROM (VALUES (1), (NULL), (3)) AS t(v);
-- 3
count(x) -> bigint()

Returns the number of non-null input values.

SELECT count(v)
FROM (VALUES (1), (NULL), (3)) AS t(v);
-- 2
count_if(x) -> bigint()

Returns the number of TRUE input values.

SELECT count_if(flag)
FROM (VALUES (TRUE), (FALSE), (NULL)) AS t(flag);
-- 1
max(x) -> [same as input]()

Returns the maximum non-null input value.

SELECT max(v)
FROM (VALUES (1), (NULL), (3)) AS t(v);
-- 3
min(x) -> [same as input]()

Returns the minimum non-null input value.

SELECT min(v)
FROM (VALUES (1), (NULL), (3)) AS t(v);
-- 1
sum(x) -> [same as input]()

Returns the sum of all non-null input values.

SELECT sum(v)
FROM (VALUES (1), (NULL), (3)) AS t(v);
-- 4

Approximate Aggregate Functions

approx_distinct(x) -> bigint()

Returns the approximate number of distinct non-null input values. This function provides an approximation of count(DISTINCT x). Zero is returned if all input values are null.

SELECT approx_distinct(v)
FROM (VALUES (1), (1), (2), (NULL)) AS t(v);
-- 2

Statistical Aggregate Functions

corr(y, x) -> double()

Returns the correlation coefficient of the non-null input pairs.

SELECT corr(y, x)
FROM (VALUES (1, 2), (3, 4), (5, 6)) AS t(y, x);
-- 1.0
covar_pop(y, x) -> double()

Returns the population covariance of the non-null input pairs.

SELECT covar_pop(y, x)
FROM (VALUES (1, 2), (3, 4), (5, 6)) AS t(y, x);
-- 2.6666666666666665
covar_samp(y, x) -> double()

Returns the sample covariance of the non-null input pairs.

SELECT covar_samp(y, x)
FROM (VALUES (1, 2), (3, 4), (5, 6)) AS t(y, x);
-- 4.0
regr_count(y, x) -> bigint()

Returns the number of input rows where both y and x are non-null. y is the dependent value and x is the independent value.

regr_avgx(y, x) -> double()

Returns the average of the independent value for non-null input pairs.

regr_avgy(y, x) -> double()

Returns the average of the dependent value for non-null input pairs.

regr_sxx(y, x) -> double()

Returns the sum of squares of the independent value for non-null input pairs.

regr_syy(y, x) -> double()

Returns the sum of squares of the dependent value for non-null input pairs.

regr_sxy(y, x) -> double()

Returns the sum of products of the dependent and independent values for non-null input pairs.

regr_slope(y, x) -> double()

Returns the slope of the linear regression line. y is the dependent value and x is the independent value.

regr_intercept(y, x) -> double()

Returns the intercept of the linear regression line. y is the dependent value and x is the independent value.

regr_r2(y, x) -> double()

Returns the coefficient of determination of the linear regression.

SELECT
    regr_count(y, x),
    regr_avgx(y, x),
    regr_avgy(y, x),
    regr_sxx(y, x),
    regr_syy(y, x),
    regr_sxy(y, x),
    regr_slope(y, x),
    regr_intercept(y, x),
    regr_r2(y, x)
FROM (
    VALUES
        (1, 2),
        (3, 4),
        (5, 6),
        (CAST(NULL AS INTEGER), 8),
        (9, CAST(NULL AS INTEGER))
) AS t(y, x);
-- 3, 4.0, 3.0, 8.0, 8.0, 8.0, 1.0, -1.0, 1.0
stddev(x) -> double()

This is an alias for stddev_samp().

stddev_pop(x) -> double()

Returns the population standard deviation of all non-null input values.

stddev_samp(x) -> double()

Returns the sample standard deviation of all non-null input values.

variance(x) -> double()

This is an alias for var_samp().

var_pop(x) -> double()

Returns the population variance of all non-null input values.

var_samp(x) -> double()

Returns the sample variance of all non-null input values.

SELECT
    var_pop(v),
    stddev_pop(v),
    var_samp(v),
    variance(v),
    stddev_samp(v),
    stddev(v)
FROM (VALUES (1), (2), (3)) AS t(v);
-- 0.6666666666666666, 0.816496580927726, 1.0, 1.0, 1.0, 1.0