One of the coolest things I learned about in my Redshift journey has been Window Functions.
Although Window functions aren’t a novel feature and exists in other SQL databases, they are a really powerful tool to have in your analysis toolbelt and fits in really well with Redshift.
Like the name suggests, Window Functions let you operate on frame or ‘window’ of data and return a value for each row in that result set.
An easy way to understand this is to think about a spreadsheet. Cells in a spreadsheet can reference arbitrary ranges of other cells relative to themselves and use their values as input to functions.
Window functions work in much the same way, and thus can be super powerful for the same kind of operations you can perform in a spreadsheet and even beyond that.
In this post, I’ll walk you through some of the Window Functions that I’ve found the most useful and hopefully help you gain a better intuition on how they work in general.
Looking ahead or behind
The most basic window functions help you access data from ‘previous’ or ‘following’ rows, much like you can do in a spreadsheet.
Two of the most useful window functions are the
Let’s say you have a simple table of timeseries data, for instance the closing balance of your stock market account per day.
Now let’s say for each day you would like to calculate your loss/gain over the previous day’s balance.
One way to do this would be something like this:
select b.date , b.balance , b.balance - previous_day.balance as gain from balances b left join balances previous_day on dateadd(day, -1, b.date) = previous_day.date order by date
lag window function allow us to simplify things, without having to do a self-join.
select date , balance , balance - lag(balance,1) over ( order by date ) as gain from balances order by date
lag function allows us to grab the value of a column in a ‘previous’ row (
lead does the opposite, by accessing the next row). We can specify the ordering of rows within the window function in the
over(order by date) clause.
One interesting thing to note is that the window function does not rely on the ordering of the query itself. If we were to order differently in the outer query, the
lag function would still calculate
Another pair of functions that are pretty useful for looking ahead or backwards are
Say you always want to calculate the difference of your daily closing balance with the first balance you had when opening the account, you can use the
first_value window function to do this:
select date , balance , balance - first_value(balance) over ( order by date rows unbounded preceding ) as total_gain from balances order by date
This time, we also include a frame clause:
rows unbounded preceding. This tells our window function to use all the rows preceding the current row, after sorting on the
Window functions also lets you do more complex aggregations on frames of data.
Let’s say you also have
trades table where you keep track of all your stock trading. It’s pretty simple and it has
Now let’s say you want to list out all my trades over time, with a running total of your profit per symbol.
We can use the
sum window function to do this:
select ts , symbol , profit , sum(profit) over ( partition by symbol order by ts rows unbounded preceding ) total_profit_for_symbol from trades order by date
As you can see, we’re using the
sum aggregation as a window function to calculate our running total. This is different from a normal
sum aggregation you would do in Redshift, because you don’t need to add a
group by clause, but instead you’re adding an
over clause, we’re still sorting by date and we also have a frame clause.
We also include a
partition by symbol clause. This will essentially ‘filter’ our aggregation to only include rows that have the same value for
symbol as our current row.
Ranking and numbering
Aside from aggregation, Window Functions can also do ranking or numbering of rows. This might not seem especially interesting at first, but once you have this tool at your disposal you’ll find lots of uses for it.
For instance, let’s say you have duplicated data that you want to filter out, but your table has no primary key to that you can use to figure out which records are duplicated.
Let’s say I had a bug in my code and I accidentally adding a bunch of duplicate trades to my
trades table and wanted to write a query to clean things up.
This is where the
row_number() function can come in very handy.
with dupe_trades as ( select * , row_number() over ( partition by ts, symbol, profit ) from trades order by ts ) select * from dupe_trades where row_number = 1
Since we’re partitioning by all the columns in the table, any set of rows with duplicate values will have a
row_number column, starting at
n is the number of duplicates.
To filter out duplicates, we can just take all the rows with a
1. This will just give us back the first row out of the set of duplicates.
You might be wondering why I used a CTE here. Well, one gotcha of Window Functions is that they can’t be used in a
where clause (the window function is applied to the result set after the where clause is run).
By wrapping the Window Function in a CTE with the
with clause, we can tell Redshift that we want to apply the Window Function before filtering.
If we wanted to see how many duplicates we have on average by symbol, we can do this:
with with_row_number as ( select * , row_number() over ( partition by ts, symbol, profit ) from trades order by ts ), with_number_of_duplicates as ( select * , (max(row_number) over ( partition by ts,symbol,profit ) -1) as number_of_duplicates from with_row_number group by 1,2,3,4 ) select symbol , sum(number_of_duplicates) as number_of_duplicates_for_symbol from with_number_of_duplicates group by 1 order by 2 desc
Here we’re making use of the
max aggregation as a window function to get the number of duplicates per set of duplicate values, together with our trick of grouping by all the other columns.
Avoiding correlated subqueries
Window Functions can be used to avoid correlated subqueries.
A subquery is correlated when it references values from the outer query. Correlated subqueries need to be evaluated for each row of the outer query and can thus be slow.
To illustrate, let’s return back to our
trades table. Let’s say I want to list all the trades, and see not only the profit for that trade but how it compares to the average profit for that symbol.
Here’s how you would do it with a correlated subquery.
select ts , symbol , profit ( select avg(profit) from trades t2 where t2.symbol = t1.symbol ) average_profit_for_symbol from trades t1 order by date
Since the subquery references the
symbol column in the outer query, it will need to be evaluated for each row.
If we had a lot of data, this could be slow. Here’s how we can implement this using the
avg window function.
select date , symbol , profit , avg(profit) over ( partition by symbol ) as average_profit_for_symbol from trades order by date
Ah much cleaner!
Window functions are super useful for calculating standard summary statistics.
Let’s revisit the query I wrote earlier to compare the profit on a trade to the average profit for trades on that symbol. Let’s say instead of the mean, I want to use the median.
To do this, I could use the
median window function.
select date , symbol , profit , median(profit) over ( partition by symbol ) as median_profit_for_symbol from trades order by date
We could also do more interesting stats like the percentiles and standard deviations.
select date , symbol , profit , median(profit) over ( partition by symbol ) as median_profit_for_symbol , stddev_pop(profit) over ( partition by symbol ) as profit_std_dev_for_symbol , ntile(4) over ( partition by symbol order by profit ) as profit_ntile_for_symbol from trades order by date
Finally, you can use window functions as a way to concatenate a list of column values into a single value. The
listagg function will group and order a list of rows and then concatenate it’s values into a single string.
Let’s say for a particular trading day we want to not only see the closing balance, but also a list of symbols that we’ve trading that day in a comma separated string.
select date , balance , balance - first_value(balance) over ( order by b.date rows unbounded preceding ) as total_gain , listagg(symbol, ', ') symbols_traded from balances left join trades on date(balances.date) = date(trades.ts) group by 1,2 order by date
One thing to be careful of is the the total size of the resulting string cannot be bigger than a
varchar(max) which is 65535 characters.
That’s a wrap!
That concludes our short tour of window functions in Redshift. My hope is that I have convinced you that they can be a powerful addition to your analysis tool belt and this gives you a good start to go explore more on your own and learn more about all the other window functions that you can use in Redshift!