# Window Functions in Redshift

One of the coolest things I learned about in my Redshift journey has been Window Functions.

Although Window functions aren’t a novel feature and exists in other SQL databases, they are a really powerful tool to have in your analysis toolbelt and fits in really well with Redshift.

Like the name suggests, Window Functions let you operate on frame or ‘window’ of data and return a value for each row in that result set.

An easy way to understand this is to think about a spreadsheet. Cells in a spreadsheet can reference arbitrary ranges of other cells relative to themselves and use their values as input to functions.

Window functions work in much the same way, and thus can be super powerful for the same kind of operations you can perform in a spreadsheet and even beyond that.

In this post, I’ll walk you through some of the Window Functions that I’ve found the most useful and hopefully help you gain a better intuition on how they work in general.

The most basic window functions help you access data from ‘previous’ or ‘following’ rows, much like you can do in a spreadsheet.

Two of the most useful window functions are the lead and lag functions.

Let’s say you have a simple table of timeseries data, for instance the closing balance of your stock market account per day.

Now let’s say for each day you would like to calculate your loss/gain over the previous day’s balance.

One way to do this would be something like this:

select
b.date
, b.balance
, b.balance - previous_day.balance as gain
from balances b
left join balances previous_day on dateadd(day, -1, b.date) = previous_day.date
order by date


The lag window function allow us to simplify things, without having to do a self-join.

select
date
, balance
, balance - lag(balance,1) over (
order by date
) as gain
from balances
order by date


The lag function allows us to grab the value of a column in a ‘previous’ row (lead does the opposite, by accessing the next row). We can specify the ordering of rows within the window function in the over(order by date) clause.

One interesting thing to note is that the window function does not rely on the ordering of the query itself. If we were to order differently in the outer query, the lag function would still calculate gain correctly.

Another pair of functions that are pretty useful for looking ahead or backwards are first_value and last_value.

Say you always want to calculate the difference of your daily closing balance with the first balance you had when opening the account, you can use the first_value window function to do this:

select
date
, balance
, balance - first_value(balance) over (
order by date
rows unbounded preceding
) as total_gain
from balances
order by date


This time, we also include a frame clause: rows unbounded preceding. This tells our window function to use all the rows preceding the current row, after sorting on the date field.

## Aggregations

Window functions also lets you do more complex aggregations on frames of data.

Let’s say you also have trades table where you keep track of all your stock trading. It’s pretty simple and it has ts (timestamp), symbol and profit columns.

Now let’s say you want to list out all my trades over time, with a running total of your profit per symbol.

We can use the sum window function to do this:

select
ts
, symbol
, profit
, sum(profit) over (
partition by symbol
order by ts
rows unbounded preceding
) total_profit_for_symbol
order by date


As you can see, we’re using the sum aggregation as a window function to calculate our running total. This is different from a normal sum aggregation you would do in Redshift, because you don’t need to add a group by clause, but instead you’re adding an over clause.

In the over clause, we’re still sorting by date and we also have a frame clause.

We also include a partition by symbol clause. This will essentially ‘filter’ our aggregation to only include rows that have the same value for symbol as our current row.

## Ranking and numbering

Aside from aggregation, Window Functions can also do ranking or numbering of rows. This might not seem especially interesting at first, but once you have this tool at your disposal you’ll find lots of uses for it.

For instance, let’s say you have duplicated data that you want to filter out, but your table has no primary key to that you can use to figure out which records are duplicated.

Let’s say I had a bug in my code and I accidentally adding a bunch of duplicate trades to my trades table and wanted to write a query to clean things up.

This is where the row_number() function can come in very handy.

with dupe_trades as (
select
*
, row_number() over (
partition by ts, symbol, profit
)
order by ts
)
where row_number = 1


Since we’re partitioning by all the columns in the table, any set of rows with duplicate values will have a row_number column, starting at 1 to n, where n is the number of duplicates.

To filter out duplicates, we can just take all the rows with a row_number of 1. This will just give us back the first row out of the set of duplicates.

You might be wondering why I used a CTE here. Well, one gotcha of Window Functions is that they can’t be used in a where clause (the window function is applied to the result set after the where clause is run).

By wrapping the Window Function in a CTE with the with clause, we can tell Redshift that we want to apply the Window Function before filtering.

If we wanted to see how many duplicates we have on average by symbol, we can do this:

with with_row_number as (
select
*
, row_number() over (
partition by ts, symbol, profit
)
order by ts
),
with_number_of_duplicates as (
select
*
, (max(row_number) over (
partition by ts,symbol,profit
) -1)  as number_of_duplicates
from with_row_number
group by 1,2,3,4
)
select
symbol
, sum(number_of_duplicates) as number_of_duplicates_for_symbol
from with_number_of_duplicates
group by 1
order by 2 desc


Here we’re making use of the max aggregation as a window function to get the number of duplicates per set of duplicate values, together with our trick of grouping by all the other columns.

## Avoiding correlated subqueries

Window Functions can be used to avoid correlated subqueries.

A subquery is correlated when it references values from the outer query. Correlated subqueries need to be evaluated for each row of the outer query and can thus be slow.

To illustrate, let’s return back to our trades table. Let’s say I want to list all the trades, and see not only the profit for that trade but how it compares to the average profit for that symbol.

Here’s how you would do it with a correlated subquery.

select
ts
, symbol
, profit
(
select
avg(profit)
where t2.symbol = t1.symbol
) average_profit_for_symbol
order by date


Since the subquery references the symbol column in the outer query, it will need to be evaluated for each row.

If we had a lot of data, this could be slow. Here’s how we can implement this using the avg window function.

select
date
, symbol
, profit
, avg(profit) over (
partition by symbol
) as average_profit_for_symbol
order by date


Ah much cleaner!

## Doing Statistics

Window functions are super useful for calculating standard summary statistics.

Let’s revisit the query I wrote earlier to compare the profit on a trade to the average profit for trades on that symbol. Let’s say instead of the mean, I want to use the median.

To do this, I could use the median window function.

select
date
, symbol
, profit
, median(profit) over (
partition by symbol
) as median_profit_for_symbol
order by date


We could also do more interesting stats like the percentiles and standard deviations.

select
date
, symbol
, profit
, median(profit) over (
partition by symbol
) as median_profit_for_symbol
, stddev_pop(profit) over (
partition by symbol
) as profit_std_dev_for_symbol
, ntile(4) over (
partition by symbol
order by profit
) as profit_ntile_for_symbol
order by date


## Aggregating lists

Finally, you can use window functions as a way to concatenate a list of column values into a single value. The listagg function will group and order a list of rows and then concatenate it’s values into a single string.

Let’s say for a particular trading day we want to not only see the closing balance, but also a list of symbols that we’ve trading that day in a comma separated string.

select
date
, balance
, balance - first_value(balance) over (
order by b.date
rows unbounded preceding
) as total_gain

One thing to be careful of is the the total size of the resulting string cannot be bigger than a varchar(max) which is 65535 characters.