Window Functions in SQL is an OLAP functionality that provides ranking, cumulative computation, and partitioning aggregation. Many commercial RDMBS such like Oracle, MS SQL Server and DB2 have implemented part of this specification, while open source RDMBS including PostgreSQL, MySQL and Firebird doesn't yet. To implement this functionality on PostgreSQL not only helps many users move from those RDBMS to PostgreSQL but encourages OLAP applications such as BI (Business Inteligence) to analyze large data set. This specification is defined first in SQL:2003, and improved in SQL:2008.
The first proposal: http://archives.postgresql.org/pgsql-hackers/2008-06/msg00380.php
The subsequent discussion: http://archives.postgresql.org/pgsql-hackers/2008-07/msg00232.php
The discussion about window function APIs: http://archives.postgresql.org/pgsql-hackers/2008-09/msg00021.php
patch v06: http://umitanuki.net/pgsql/wfv06/window_functions.patch.20081012
patch v06 applied source git: http://git.postgresql.org/git/~davidfetter/window_functions/.git
sample SQL: http://umitanuki.net/pgsql/wfv06/sample.sql
Below is a description of how it is designed in the patch so far.
Below may be dropped features for 8.4.
Lacking these feature, this stage looks compatible to SQLServer 2005, while Oracle and DB2 have almost full of the specification. Nevertheless, the current window function APIs design is accepted by audience, sliding window frame is not difficult to implement for 8.4.
When posted and discussed in -hackers list, a bit confusing was about terminology. So be aware of these definitions.
An expression evaluated in a Window node, which is one of rank function, aggregate function, ntile function, lead or lag function, first or last value function, or nth value function. In a Window node, only TargetEntry that has window expression is evaluated, while other entries are evaluated some outer (scans, joins, aggs) node.
This type of function returns different or the same values row by row. Since this function needs to know and operate "current window", we will need to add new mechanism to PostgreSQL. This includes new spec function such as ROW_NUMBER(), RANK(), DENSE_RANK(), LEAD(), LAG(), etc. In v06, a new attribute pg_proc.proiswfunc is introduced, indicating the function has capability to be called as a (non-aggregate) window function.
The rest part of window expression. This type of function scans tuples along the specified window frame, then returns the same values as long as the window frame doesn't slide. We can use aggregate function we already have and there's no need to add/introduce anything new. In v06, pure aggregate functions are treated as special subset of window functions. See eval_windowaggregate() in nodeWindow.c.
This is a normal aggregate that PostgreSQL has already. "Normal" means "not windowed". In some SQL spec documents, they call it "group aggregate".
Also called a window frame, which is represented in SQL syntax by "ROW BETWEEN...", "RANGE BETWEEN...", "CURRENT ROW...", etc. This range slides row by row in a partitoned window and window functions/aggregates can access any row within this moving frame, thus we need to introduce some mechanism to optimize not to allocate wasting memory.
The sample table is like this.
sample=# SELECT * FROM empsalary;
depname | empno | salary | enroll_date -----------+-------+--------+------------- develop | 10 | 5200 | 2007-08-01 sales | 1 | 5000 | 2006-10-01 personnel | 5 | 3500 | 2007-12-10 sales | 4 | 4800 | 2007-08-08 sales | 6 | 5500 | 2007-01-02 personnel | 2 | 3900 | 2006-12-23 develop | 7 | 4200 | 2008-01-01 develop | 9 | 4500 | 2008-01-01 sales | 3 | 4800 | 2007-08-01 develop | 8 | 6000 | 2006-10-01 develop | 11 | 5200 | 2007-08-15 (11 rows)
Now let's throw a windowed query.
sample=# SELECT sample-# depname, sample-# empno, sample-# salary, sample-# sum(salary) OVER (PARTITION BY depname) sample-# FROM sample-# empsalary;
depname | empno | salary | sum -----------+-------+--------+------- develop | 10 | 5200 | 25100 develop | 7 | 4200 | 25100 develop | 9 | 4500 | 25100 develop | 8 | 6000 | 25100 develop | 11 | 5200 | 25100 personnel | 2 | 3900 | 7400 personnel | 5 | 3500 | 7400 sales | 3 | 4800 | 20100 sales | 1 | 5000 | 20100 sales | 4 | 4800 | 20100 sales | 6 | 5500 | 20100 (11 rows)
You may see dep_sum is the result of SUM() for each depname, and year_sum is the result of SUM() for each enrolling year, without rows aggregated.
The ranking function of window function works as:
sample=# SELECT sample-# depname, sample-# empno, sample-# salary, sample-# rank() OVER (PARTITION BY depname ORDER BY salary) sample-# FROM sample-# empsalary;
depname | empno | salary | rank -----------+-------+--------+------ develop | 7 | 4200 | 1 develop | 9 | 4500 | 2 develop | 10 | 5200 | 3 develop | 11 | 5200 | 3 develop | 8 | 6000 | 5 personnel | 5 | 3500 | 1 personnel | 2 | 3900 | 2 sales | 4 | 4800 | 1 sales | 3 | 4800 | 1 sales | 1 | 5000 | 3 sales | 6 | 5500 | 4 (11 rows)
Another example shows a use in combination with GROUP BY clause.
sample=# SELECT sample=# y, sample=# m, sample=# SUM(SUM(people)) OVER (PARTITION BY y ORDER BY m), sample=# AVG(people) sample=# FROM( sample=# SELECT sample=# EXTRACT(YEAR FROM accident_date) AS y, sample=# EXTRACT(MONTH FROM accident_date) AS m, sample=# * sample=# FROM sample=# accident sample=# )s sample=# GROUP BY y, m;
y | m | sum | avg ------+----+------+-------------------- 2005 | 1 | 1698 | 3.5161290322580645 2005 | 2 | 1698 | 4.8928571428571429 2005 | 3 | 1698 | 4.3870967741935484 2005 | 4 | 1698 | 4.7333333333333333 2005 | 5 | 1698 | 5.0967741935483871 2005 | 6 | 1698 | 5.2666666666666667 2005 | 7 | 1698 | 4.8709677419354839 2005 | 8 | 1698 | 4.7419354838709677 2005 | 9 | 1698 | 4.8000000000000000 2005 | 10 | 1698 | 4.8709677419354839 2005 | 11 | 1698 | 4.1333333333333333 2005 | 12 | 1698 | 4.5483870967741935 2006 | 1 | 1740 | 4.3870967741935484 2006 | 2 | 1740 | 4.5000000000000000 2006 | 3 | 1740 | 4.8387096774193548 2006 | 4 | 1740 | 5.0333333333333333 2006 | 5 | 1740 | 4.4838709677419355 2006 | 6 | 1740 | 4.1333333333333333 2006 | 7 | 1740 | 5.1935483870967742 2006 | 8 | 1740 | 4.7419354838709677 2006 | 9 | 1740 | 3.8333333333333333 2006 | 10 | 1740 | 6.2258064516129032 2006 | 11 | 1740 | 4.4333333333333333 2006 | 12 | 1740 | 5.3225806451612903 (24 rows)
You can put any expressions as window function's arguments or PARTITION BY/ORDER BY clause as long as it satisfies condition that normal aggregate requires.
Now WINDOW clause is shown.
sample=# SELECT depname, empno, salary, sum(salary) OVER w FROM empsalary WINDOW w AS (PARTITION BY depname);
depname | empno | salary | sum -----------+-------+--------+------- develop | 11 | 5200 | 25100 develop | 7 | 4200 | 25100 develop | 9 | 4500 | 25100 develop | 8 | 6000 | 25100 develop | 10 | 5200 | 25100 personnel | 5 | 3500 | 7400 personnel | 2 | 3900 | 7400 sales | 3 | 4800 | 14600 sales | 1 | 5000 | 14600 sales | 4 | 4800 | 14600 (10 rows)
Note that a window definition which is not referred from any function is ignored.
All above is defined in nodeWindow.c temporarily. Maybe it should be moved to backend/utils/adt or somewhere.
These are all in nodeWindow.c
EXPLAIN SELECT sum(salary) OVER (PARTITION BY depname) AS dep_sum ,sum(salary) OVER (PARTITION BY extract(YEAR FROM enroll_date)) AS year_sum ,* FROM empsalary;
QUERY PLAN
-----------------------------------------------------------------------------------------
Window (cost=127.23..129.83 rows=1040 width=48)
-> Sort (cost=127.23..129.83 rows=1040 width=48)
Sort Key: (date_part('year'::text, (enroll_date)::timestamp without time zone))
-> Window (cost=72.52..75.12 rows=1040 width=48)
-> Sort (cost=72.52..75.12 rows=1040 width=48)
Sort Key: depname
-> Seq Scan on empsalary (cost=0.00..20.40 rows=1040 width=48)
This plan is quite ugly, because for each window a Window node is implicitly added with a Sort node. Probably all of window and sort process is packed into a Window node. For this current plan, Sort node uses Tuplesort as you expect then Window node uses Tuplestore to store each Partition tuples. This is supposed to be the worst plan. We are able to get it better somehow.
The final window functions design is described in nodeWindow.c:
* A window function is defined as a function maked as wfunc in pg_proc. By this * mark, it means the function can handle window frame APIs and window iteration APIs * that allow it to access arbitrary random rows within the frame. Window node * can aggregate function as well, treating it as special case. Contrast to * Agg node, aggregate functions are executed one by one from initializing to * finalizing, which suits the Window execution model. The aggregated result * is cached in a WindowStatePerAgg struct and is recycled if the frame didn't * moved since the last execution, otherwise, re-run whole aggregation. This * is not efficient so the aggregate function can be a window function, which * can subtract values from the previous result by which row is dropped since * the last execution. fcinfo->context will be a WindowState instead of AggState * if the aggregate function is called as a window function. * * Note that the solid concept of the window functions is "they can access * arbitrary rows within the frame as they want". As far as we keep the rule * of thumb, any kind of optimization is allowed.
Points are:
Not implemented yet, but with this design we will see FRAME clause and optimized window aggregate soon. Currently window aggregate caches its result unless the frame moves. But once the frame moved, aggregate must be re-aggregate all the tuples. To avoid this inefficiency, we can define aggregates as window functions. The pseudo code is such like:
Datum
count_window(PG_FUNCTION_ARGS)
{
WindowFrame frame = PG_WINDOW_FRAME();
count_context *context = fcinfo->flinfo->fn_extra;
allocate_if_new(context, sizeof(int64));
if (window_moved(frame))
{
for row in dropped:
context->result--;
}
context->result++;
PG_RETURN_INT64(context->result);
}
Ineficiency: While aggregates can be optimized as above, some simple functions like row_number(), rank() are still inefficient in the meaning of storing non-used rows. These functions don't require any following rows from current row to return the result. The row_number() may be treated special case such as min/max in aggregate. Also, some catalog like pg_wfunc can be added to define what type (i.e. which rows of following/preceding/all rows in the frame needed) the window function is. And preceding rows are also not needed in some functions. Maybe cutoff API proposed by Heikki will be needed to make this reality.
/* window function APIs */ #define PG_WINDOW_FRAME() ((WindowFrameData *) ((WindowState *) fcinfo->context)->frame) #define PG_WINDOW_ARG(n) ((ExprState *) (DatumGetPointer(fcinfo->arg[n]))) #define WINDOW_SEEK_CURRENT 0 #define WINDOW_SEEK_HEAD 1 #define WINDOW_SEEK_TAIL 2 /* * private in nodeWindow.c but exposed as window function APIs */ typedef struct WindowFrameData *WindowFrame; typedef struct WindowIterData *WindowIter; /* * window_currentpos * return the position of CURRENT ROW in the frame. */ extern int window_currentpos(WindowFrame frame); /* * window_framerow * return total row number of the frame. */ extern int window_framerow(WindowFrame frame); /* * window_getarg * return the value of the specific argument of the specific row. * relpos can be positive or negative. */ extern Datum window_getarg(WindowFrame frame, ExprState *argstate, int relpos, int seektype, bool *isnull); /* * window_gettuple * set the specified row to slot and return true if succeeded. * relpos can be positive or negative. */ extern bool window_gettuple(WindowFrame frame, TupleTableSlot *slot, int relpos, int seektype); /* * window_moved * If the window frame moved within the current partition. * This cannot be happened unless the frame clause is given. */ extern bool window_moved(WindowFrame frame); /* * window_iter_init * initialize WindowIter from start of the frame. */ extern WindowIter window_iter_init(WindowFrame frame); /* * window_iter_next * advance the iterator to the next row. any tuple fetching is allowed * after this method. */ extern bool window_iter_next(WindowIter iter); /* * window_iter_getarg * get datum from the current iterator row. */ extern Datum window_iter_getarg(WindowIter iter, ExprState *argstate, bool *isnull);
These shown below are *former* ideas about how the window function is made up. Pros & cons from these ideas are considered into the current design.
CREATE AGGREGATE window_func() ( sfunc = ... stype = ... wfunc = ... initcond = ) For each row we would execute the transition function (sfunc) then, if there is a window function (wfunc) then we call that to return a value for this tuple (so in that case we execute two functions per tuple in the window). If wfunc is not set then we return the transition datatype itself. http://archives.postgresql.org/pgsql-hackers/2008-07/msg00236.php
Objection: A window aggregate is same as a grouping aggregate. Also, some of window functions need full scan of rows *before* returning values.
So that would mean we don't provide a mechanism for user-defined windowed aggregate functions at all. Which solves the discussion about how to pass generic info through to them (at least long enough to get the first implementation done). http://archives.postgresql.org/pgsql-hackers/2008-07/msg00239.php
Objection: As mentioned, it hides the definition of functions from external user so that implementation is easier. However, it is odd as other function types is extensible and SQL spec may add more functions later. Some unification seems need.
Just idea, how about pass window object to a function? We'll provide
window operation API then in the function you take window object
through fcinfo:
Datum func(PG_FUNCTION_ARGS)
{
Datum v;
WindowObject w = get_window(fcinfo);
HeapTuple htup_current = window_current_row(w);
HeapTuple htup_prev = window_preceding(w, 1);
/* do something */
PG_RETURN_DATUM(v);
}
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00254.php
Objection: You should consider about the performance. Some optimization mechanism is required.
More valuable discussion about the design starts here: http://archives.postgresql.org/pgsql-hackers/2008-09/msg00021.php
1. Implement Window node, with the capability to invoke an aggregate function, using the above API. Implement required parser/planner changes. Implement a few simple ranking aggregates using the API. 2. Implement glue to call normal aggregates through the new API 3. Implement the optimization to drop rows that are no longer needed (signal_cutoff can be a no-op until this phase) 4. Implement window framing (the frame can always be all rows in the partition, or all rows until the current row, until this phase) 5. Expose the new API to user-defined aggregates. It can be an internal API only used by built-in functions until this phase I believe you already have phase 1 in your patch, except for the API changes
It seems that we must add something like Window object mechanism that represents a window frame, to describe logical window. At the moment there needs to be careful not to cut its performance.
this document is as of 2008/10/12, written by Hitoshi Harada (umi.tanuki@gmail.com)