Back to blog

The Cohort Retention SQL Query I Actually Use

Spencer Pauly
Spencer Pauly
3 min read
The Cohort Retention SQL Query I Actually Use

Retention is the metric people most want and least want to write the SQL for. The chart is simple to picture: group users by the month they signed up, then track how many came back in each later month. Writing it from scratch every time is annoying enough that most people just don't, which is why so few teams actually look at retention.

Here's the query I keep around, with the parts explained so you can bend it to your tables.

The shape of the problem

You need two facts about each user: when they first showed up, and which months they were active. Almost every schema has these somewhere. Signup is usually a created_at on a users table. Activity is usually a timestamp on whatever event means "this person used the product" — an order, a session, a login.

Two CTEs, then a join. That's the whole thing.

with cohorts as (
  select
    id as user_id,
    date_trunc('month', created_at) as cohort_month
  from users
),
activity as (
  select
    user_id,
    date_trunc('month', occurred_at) as active_month
  from events
  group by 1, 2
)
select
  c.cohort_month,
  -- months since signup: 0, 1, 2, ...
  (extract(year from a.active_month) - extract(year from c.cohort_month)) * 12
    + (extract(month from a.active_month) - extract(month from c.cohort_month)) as month_number,
  count(distinct c.user_id) as active_users
from cohorts c
join activity a on a.user_id = c.user_id
where a.active_month >= c.cohort_month
group by 1, 2
order by 1, 2;

That gives you a long table: for each cohort, how many users were active 0, 1, 2, … months after signing up. Month 0 is your cohort size.

Turning counts into percentages

Raw counts are hard to compare across cohorts of different sizes. You want the percentage that came back. Wrap the result and divide by each cohort's month-0 count:

with retention as (
  -- the query above goes here, as a CTE
  ...
),
cohort_size as (
  select cohort_month, active_users as size
  from retention
  where month_number = 0
)
select
  r.cohort_month,
  r.month_number,
  r.active_users,
  round(100.0 * r.active_users / s.size, 1) as pct_retained
from retention r
join cohort_size s on s.cohort_month = r.cohort_month
order by r.cohort_month, r.month_number;

Now pct_retained is the number you actually plot. Month 0 is always 100%. The shape of the dropoff after that tells you whether your product sticks.

The two mistakes I always see

The first is counting activity that includes month 0 as if it were retention. Of course people are active the month they signed up. That's not retention, that's the denominator. Month 1 onward is the real story.

The second is using a "last seen" timestamp instead of monthly activity buckets. If you only track each user's most recent activity, you can't reconstruct which months they were present. You need the event-level data, bucketed by month. If your events table is huge, this query will be slow, so run it against a read replica or materialize the monthly activity rollup once a day.

Or just ask for it

I'll be honest about why this query lives in a snippet file and not in my head: I write it maybe once a quarter, and every time I have to re-derive the month_number arithmetic. These days I mostly just ask in plain English and read the SQL it produces to check the join. That's the workflow QueryBear is built around, but the query above works fine pasted straight into psql. Keep it somewhere. Future you will want it the next time someone asks "is retention getting better or worse" and expects an answer before the meeting ends.

Database Access

Give Your AI Agents
Database Access. Securely.

Connect any database. Control permissions. Audit every query. All running locally on your machine.