Problem

How to get index scan on indexed timestamp column in postgres alongside a predicate?

Solution

Ensure usage of functions which are tagged as STABLE or IMMUTATBLE in query predicate and also explicitly specify the predicate in case of partial index

Case Study

Consider the following table named User. (Can directly jump to explanation section)

User
+ id (INT)
+ username (TEXT)
+ created_at (TIMESTAMPTZ)

At some point the table saw exponential growth in inserts so it was decided to index created_at column but since only data from frequent time series was required (let’s say last one month), it made sense to just partially index the table. Table was partially index using the following query

CREATE INDEX partial_idx_user_created_at_2021 ON public."User" USING BTREE ("created_at") WHERE (("created_at" >= '2021-01-01 00:00:00+00'::TIMESTAMPTZ))

Now, our motive is to get count of users created in past 1 month, let’s try query plan generated for different queries and try to get the best possible plan (probably a index scan in our case)

Attempt 1

EXPLAIN SELECT 
	COUNT(id) 
FROM 
	"User" 
WHERE
	"created_at"::DATE BETWEEN 
	(CURRENT_DATE + INTERVAL '0 HOURS')::TIMESTAMPTZ - INTERVAL '1 MONTH' AND
  (CURRENT_DATE + INTERVAL '0 HOURS')::TIMESTAMPTZ;

-- output
Finalize Aggregate  (cost=8213439.22..8213439.23 rows=1 width=8)
  ->  Gather  (cost=8213439.01..8213439.22 rows=2 width=8)
	Workers Planned: 2
	->  Partial Aggregate  (cost=8212439.01..8212439.02 rows=1 width=8)
		->  Parallel Seq Scan on "User"  (cost=0.00..8210763.50 rows=670203 width=4)
				Filter: ((("created_at")::date <= ((CURRENT_DATE + '00:00:00'::interval))::timestamp with time zone) AND
					(("created_at")::date >= (((CURRENT_DATE + '00:00:00'::interval))::timestamp with time zone - '1 mon'::interval)))

we are getting a sequential scan makes sense, postgres will go through the entire table and pluck out the rows where the column satisfy the condition.

We need to force it to utilize the partial_idx_user_created_at_2021 partial index collection, we will do so by explicitly specifying the predicate from the partial index in the query, and check the query generated.

Attempt 2

EXPLAIN SELECT 
	COUNT(id) 
FROM 
	"User" 
WHERE
	"created_at" >= '2021-01-01 00:00:00+00'::TIMESTAMPTZ -- PREDICATE FROM PARTIAL INDEX
	AND 
		"created_at"::DATE BETWEEN (CURRENT_DATE + INTERVAL '0 HOURS')::TIMESTAMPTZ - INTERVAL '1 MONTH' AND
	  (CURRENT_DATE + INTERVAL '0 HOURS')::TIMESTAMPTZ;

-- output
Finalize Aggregate  (cost=8547479.33..8547479.34 rows=1 width=8)
->  Gather  (cost=8547479.12..8547479.32 rows=2 width=8)
	Workers Planned: 2
	->  Partial Aggregate  (cost=8546479.12..8546479.12 rows=1 width=8)
		->  Parallel Seq Scan on "User"  (cost=0.00..8545865.00 rows=245646 width=4)
				Filter: (("createdAt" >= '2021-01-01 00:00:00+00'::timestamp with time zone) AND 
				(("created_at")::date <= ((CURRENT_DATE + '00:00:00'::interval))::timestamp with time zone) AND 
				(("created_at")::date >= (((CURRENT_DATE + '00:00:00'::interval))::timestamp with time zone - '1 mon'::interval)))

Still getting sequential scan? We specified the predicate from partial index, and that definitely made into the query planner but it’s still giving us sequential scan! After pondering and googling about it a bit, I came across the following doc https://www.postgresql.org/docs/8.3/xfunc-volatility.html , let’s make a small tweak in the query and try again.

Final Attempt

EXPLAIN SELECT 
	COUNT(id) 
FROM 
	"User" 
WHERE
	"created_at"::TIMESTAMPTZ >= '2021-01-01 00:00:00+00'::TIMESTAMPTZ -- casting to timestampz(actual type)
	AND 
		"created_at"::DATE BETWEEN (CURRENT_DATE + INTERVAL '0 HOURS')::TIMESTAMPTZ - INTERVAL '1 MONTH' AND
	  (CURRENT_DATE + INTERVAL '0 HOURS')::TIMESTAMPTZ;

-- output
Aggregate  (cost=8.61..8.62 rows=1 width=8)
->  Index Scan using partial_idx_user_created_at_2021 on "User"  (cost=0.58..8.61 rows=1 width=4)
  	Index Cond: (("created_at" >= (((CURRENT_DATE + '00:00:00'::interval))::timestamp with time zone - '1 mon'::interval)) AND ("created_at" <= ((CURRENT_DATE + '00:00:00'::interval))::timestamp with time zone))

And, finally we get a index scan with predicate condition. So what changed? We see the planner no longer mentions the predicate from partial_idx_user_created_at_2021 instead it just uses that partial index. Meaning it decided it was safe to use. Well our mistake was(I have made similar mistake in prod and hence writing this blog), casting created_at to DATE doing so was an example of a VOLATILE function, and whenever such a function is used in a query planner as a conservative approach postgres doesn’t refer to the index.

Explanation

An index can be used, when the indexed column(s) is/are compared to constants (literal values), function calls, which are marked at least STABLE (which means that within a single statement, multiple calls of the functions, with same parameters, will produce the same results), or/and IMMUATABLE (function cannot modify the database and is guaranteed to return the same results given the same arguments forever) or combination of those.

In our case study, partial index was on created_at of type TIMESTAMP WITH TIMEZONE

casting created_at to date was an example of potentially VOLATILE function since their is a possible case for the server’s timezone to change (timezone offset specified from column) hence yielding inconsistent results.