One of the coolest things I learned about in my Redshift journey has been Window Functions.
Although Window functions aren’t a novel feature and exists in other SQL databases, they are a really powerful tool to have in your analysis toolbelt and fits in really well with Redshift.
Like the name suggests, Window Functions let you operate on frame or ‘window’ of data and return a value for each row in that result set.
An easy way to understand this is to think about a spreadsheet. Cells in a spreadsheet can reference arbitrary ranges of other cells relative to themselves and use their values as input to functions.
Window functions work in much the same way, and thus can be super powerful for the same kind of operations you can perform in a spreadsheet and even beyond that.
In this post, I’ll walk you through some of the Window Functions that I’ve found the most useful and hopefully help you gain a better intuition on how they work in general.
Looking ahead or behind
The most basic window functions help you access data from ‘previous’ or ‘following’ rows, much like you can do in a spreadsheet.
Two of the most useful window functions are the lead
and lag
functions.
Let’s say you have a simple table of timeseries data, for instance the closing balance of your stock market account per day.
Now let’s say for each day you would like to calculate your loss/gain over the previous day’s balance.
One way to do this would be something like this:
select
b.date
, b.balance
, b.balance - previous_day.balance as gain
from balances b
left join balances previous_day on dateadd(day, -1, b.date) = previous_day.date
order by date
The lag
window function allow us to simplify things, without having to do a self-join.
select
date
, balance
, balance - lag(balance,1) over (
order by date
) as gain
from balances
order by date
The lag
function allows us to grab the value of a column in a ‘previous’ row (lead
does the opposite, by accessing the next row). We can specify the ordering of rows within the window function in the over(order by date)
clause.
One interesting thing to note is that the window function does not rely on the ordering of the query itself. If we were to order differently in the outer query, the lag
function would still calculate gain
correctly.
Another pair of functions that are pretty useful for looking ahead or backwards are first_value
and last_value
.
Say you always want to calculate the difference of your daily closing balance with the first balance you had when opening the account, you can use the first_value
window function to do this:
select
date
, balance
, balance - first_value(balance) over (
order by date
rows unbounded preceding
) as total_gain
from balances
order by date
This time, we also include a frame clause: rows unbounded preceding
. This tells our window function to use all the rows preceding the current row, after sorting on the date
field.
Aggregations
Window functions also lets you do more complex aggregations on frames of data.
Let’s say you also have trades
table where you keep track of all your stock trading. It’s pretty simple and it has ts
(timestamp), symbol
and profit
columns.
Now let’s say you want to list out all my trades over time, with a running total of your profit per symbol.
We can use the sum
window function to do this:
select
ts
, symbol
, profit
, sum(profit) over (
partition by symbol
order by ts
rows unbounded preceding
) total_profit_for_symbol
from trades
order by date
As you can see, we’re using the sum
aggregation as a window function to calculate our running total. This is different from a normal sum
aggregation you would do in Redshift, because you don’t need to add a group by
clause, but instead you’re adding an over
clause.
In the over
clause, we’re still sorting by date and we also have a frame clause.
We also include a partition by symbol
clause. This will essentially ‘filter’ our aggregation to only include rows that have the same value for symbol
as our current row.
Ranking and numbering
Aside from aggregation, Window Functions can also do ranking or numbering of rows. This might not seem especially interesting at first, but once you have this tool at your disposal you’ll find lots of uses for it.
For instance, let’s say you have duplicated data that you want to filter out, but your table has no primary key to that you can use to figure out which records are duplicated.
Let’s say I had a bug in my code and I accidentally adding a bunch of duplicate trades to my trades
table and wanted to write a query to clean things up.
This is where the row_number()
function can come in very handy.
with dupe_trades as (
select
*
, row_number() over (
partition by ts, symbol, profit
)
from trades
order by ts
)
select * from dupe_trades
where row_number = 1
Since we’re partitioning by all the columns in the table, any set of rows with duplicate values will have a row_number
column, starting at 1
to n
, where n
is the number of duplicates.
To filter out duplicates, we can just take all the rows with a row_number
of 1
. This will just give us back the first row out of the set of duplicates.
You might be wondering why I used a CTE here. Well, one gotcha of Window Functions is that they can’t be used in a where
clause (the window function is applied to the result set after the where clause is run).
By wrapping the Window Function in a CTE with the with
clause, we can tell Redshift that we want to apply the Window Function before filtering.
If we wanted to see how many duplicates we have on average by symbol, we can do this:
with with_row_number as (
select
*
, row_number() over (
partition by ts, symbol, profit
)
from trades
order by ts
),
with_number_of_duplicates as (
select
*
, (max(row_number) over (
partition by ts,symbol,profit
) -1) as number_of_duplicates
from with_row_number
group by 1,2,3,4
)
select
symbol
, sum(number_of_duplicates) as number_of_duplicates_for_symbol
from with_number_of_duplicates
group by 1
order by 2 desc
Here we’re making use of the max
aggregation as a window function to get the number of duplicates per set of duplicate values, together with our trick of grouping by all the other columns.
Avoiding correlated subqueries
Window Functions can be used to avoid correlated subqueries.
A subquery is correlated when it references values from the outer query. Correlated subqueries need to be evaluated for each row of the outer query and can thus be slow.
To illustrate, let’s return back to our trades
table. Let’s say I want to list all the trades, and see not only the profit for that trade but how it compares to the average profit for that symbol.
Here’s how you would do it with a correlated subquery.
select
ts
, symbol
, profit
(
select
avg(profit)
from trades t2
where t2.symbol = t1.symbol
) average_profit_for_symbol
from trades t1
order by date
Since the subquery references the symbol
column in the outer query, it will need to be evaluated for each row.
If we had a lot of data, this could be slow. Here’s how we can implement this using the avg
window function.
select
date
, symbol
, profit
, avg(profit) over (
partition by symbol
) as average_profit_for_symbol
from trades
order by date
Ah much cleaner!
Doing Statistics
Window functions are super useful for calculating standard summary statistics.
Let’s revisit the query I wrote earlier to compare the profit on a trade to the average profit for trades on that symbol. Let’s say instead of the mean, I want to use the median.
To do this, I could use the median
window function.
select
date
, symbol
, profit
, median(profit) over (
partition by symbol
) as median_profit_for_symbol
from trades
order by date
We could also do more interesting stats like the percentiles and standard deviations.
select
date
, symbol
, profit
, median(profit) over (
partition by symbol
) as median_profit_for_symbol
, stddev_pop(profit) over (
partition by symbol
) as profit_std_dev_for_symbol
, ntile(4) over (
partition by symbol
order by profit
) as profit_ntile_for_symbol
from trades
order by date
Aggregating lists
Finally, you can use window functions as a way to concatenate a list of column values into a single value. The listagg
function will group and order a list of rows and then concatenate it’s values into a single string.
Let’s say for a particular trading day we want to not only see the closing balance, but also a list of symbols that we’ve trading that day in a comma separated string.
select
date
, balance
, balance - first_value(balance) over (
order by b.date
rows unbounded preceding
) as total_gain
, listagg(symbol, ', ') symbols_traded
from balances
left join trades on date(balances.date) = date(trades.ts)
group by 1,2
order by date
One thing to be careful of is the the total size of the resulting string cannot be bigger than a varchar(max)
which is 65535 characters.
That’s a wrap!
That concludes our short tour of window functions in Redshift. My hope is that I have convinced you that they can be a powerful addition to your analysis tool belt and this gives you a good start to go explore more on your own and learn more about all the other window functions that you can use in Redshift!